exploiting computation and communication overlap in ... · distributed memory model . ... basic...
TRANSCRIPT
Exploiting Computation and Communication Overlap in MVAPICH2 and MVAPICH2-GDR MPI Libraries
Dhabaleswar K. (DK) Panda
The Ohio State UniversityE-mail: [email protected]
http://www.cse.ohio-state.edu/~panda
Talk at
Overlapping Communication with Computation Symposium (April ‘18)
by
Overlap Symposium (April ’18) 2Network Based Computing Laboratory
Parallel Programming Models Overview
P1 P2 P3
Shared Memory
P1 P2 P3
Memory Memory Memory
P1 P2 P3
Memory Memory MemoryLogical shared memory
Shared Memory Model
SHMEM, DSMDistributed Memory Model
MPI (Message Passing Interface)
Partitioned Global Address Space (PGAS)
Global Arrays, UPC, Chapel, X10, CAF, …
• Programming models provide abstract machine models
• Models can be mapped on different types of systems– e.g. Distributed Shared Memory (DSM), MPI within a node, etc.
• PGAS models and Hybrid MPI+PGAS models are gradually receiving importance
Overlap Symposium (April ’18) 3Network Based Computing Laboratory
Supporting Programming Models for Multi-Petaflop and Exaflop Systems: Challenges
Programming ModelsMPI, PGAS (UPC, Global Arrays, OpenSHMEM), CUDA, OpenMP,
OpenACC, Cilk, Hadoop (MapReduce), Spark (RDD, DAG), etc.
Application Kernels/Applications
Networking Technologies(InfiniBand, 40/100GigE,
Aries, and Omni-Path)
Multi-/Many-coreArchitectures
Accelerators(GPU and FPGA)
MiddlewareCo-Design
Opportunities and
Challenges across Various
Layers
PerformanceScalabilityResilience
Communication Library or Runtime for Programming ModelsPoint-to-point
CommunicationCollective
CommunicationEnergy-
AwarenessSynchronization
and LocksI/O and
File SystemsFault
Tolerance
Overlap Symposium (April ’18) 4Network Based Computing Laboratory
Basic Concept of Overlapping Communication with Computation
Networking Technology
MPI Runtime
Application
Design MPI Primitives ExploitingOverlap Capabilities of Network Mechanisms
Take Advantage of Overlap- Transparently
- Co-design
Overlap Symposium (April ’18) 5Network Based Computing Laboratory
Overview of the MVAPICH2 Project• High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE)
– MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.1), Started in 2001, First version available in 2002
– MVAPICH2-X (MPI + PGAS), Available since 2011
– Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014
– Support for Virtualization (MVAPICH2-Virt), Available since 2015
– Support for Energy-Awareness (MVAPICH2-EA), Available since 2015
– Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 2015
– Used by more than 2,875 organizations in 86 countries
– More than 460,000 (> 0.46 million) downloads from the OSU site directly
– Empowering many TOP500 clusters (Nov ‘17 ranking)• 1st, 10,649,600-core (Sunway TaihuLight) at National Supercomputing Center in Wuxi, China
• 9th, 556,104 cores (Oakforest-PACS) in Japan
• 12th, 368,928-core (Stampede2) at TACC
• 17th, 241,108-core (Pleiades) at NASA
• 48th, 76,032-core (Tsubame 2.5) at Tokyo Institute of Technology
– Available with software stacks of many vendors and Linux Distros (RedHat and SuSE)
– http://mvapich.cse.ohio-state.edu
• Empowering Top500 systems for over a decade
Overlap Symposium (April ’18) 6Network Based Computing Laboratory
0
50000
100000
150000
200000
250000
300000
350000
400000
450000
500000Se
p-04
Feb-
05
Jul-0
5
Dec-
05
May
-06
Oct
-06
Mar
-07
Aug-
07
Jan-
08
Jun-
08
Nov
-08
Apr-
09
Sep-
09
Feb-
10
Jul-1
0
Dec-
10
May
-11
Oct
-11
Mar
-12
Aug-
12
Jan-
13
Jun-
13
Nov
-13
Apr-
14
Sep-
14
Feb-
15
Jul-1
5
Dec-
15
May
-16
Oct
-16
Mar
-17
Aug-
17
Jan-
18
Num
ber o
f Dow
nloa
ds
Timeline
MV
0.9.
4
MV2
0.9
.0
MV2
0.9
.8
MV2
1.0
MV
1.0
MV2
1.0.
3
MV
1.1
MV2
1.4
MV2
1.5
MV2
1.6
MV2
1.7
MV2
1.8
MV2
1.9
MV2
-GD
R 2.
0b
MV2
-MIC
2.0
MV2
-GD
R 2
.3a
MV2
-X2.
3bM
V2Vi
rt 2
.2
MV2
2.3
rc1
OSU
INAM
0.9
.3
MVAPICH2 Release Timeline and Downloads
Overlap Symposium (April ’18) 7Network Based Computing Laboratory
Architecture of MVAPICH2 Software Family
High Performance Parallel Programming Models
Message Passing Interface(MPI)
PGAS(UPC, OpenSHMEM, CAF, UPC++)
Hybrid --- MPI + X(MPI + PGAS + OpenMP/Cilk)
High Performance and Scalable Communication RuntimeDiverse APIs and Mechanisms
Point-to-point
Primitives
Collectives Algorithms
Energy-
Awareness
Remote Memory Access
I/O and
File Systems
Fault
ToleranceVirtualization Active
MessagesJob Startup
Introspection & Analysis
Support for Modern Networking Technology(InfiniBand, iWARP, RoCE, Omni-Path)
Support for Modern Multi-/Many-core Architectures(Intel-Xeon, OpenPower, Xeon-Phi, ARM, NVIDIA GPGPU)
Transport Protocols Modern Features
RC XRC UD DC UMR ODPSR-IOV
Multi Rail
Transport MechanismsShared
MemoryCMA IVSHMEM
Modern Features
MCDRAM* NVLink* CAPI*
* Upcoming
XPMEM*
Overlap Symposium (April ’18) 8Network Based Computing Laboratory
MVAPICH2 Software Family High-Performance Parallel Programming Libraries
MVAPICH2 Support for InfiniBand, Omni-Path, Ethernet/iWARP, and RoCE
MVAPICH2-X Advanced MPI features, OSU INAM, PGAS (OpenSHMEM, UPC, UPC++, and CAF), and MPI+PGAS programming models with unified communication runtime
MVAPICH2-GDR Optimized MPI for clusters with NVIDIA GPUs
MVAPICH2-Virt High-performance and scalable MPI for hypervisor and container based HPC cloud
MVAPICH2-EA Energy aware and High-performance MPI
MVAPICH2-MIC Optimized MPI for clusters with Intel KNC
Microbenchmarks
OMB Microbenchmarks suite to evaluate MPI and PGAS (OpenSHMEM, UPC, and UPC++) libraries for CPUs and GPUs
Tools
OSU INAM Network monitoring, profiling, and analysis for clusters with MPI and scheduler integration
OEMT Utility to measure the energy consumption of MPI applications
Overlap Symposium (April ’18) 9Network Based Computing Laboratory
• MVAPICH2/MVAPICH2-X– Job Startup
– Point-to-point Communication
– Remote Memory Access (RMA)
– Collective Communication
• MVAPICH2-GDR– Support for InfiniBand Core-Direct
– GPU-kernel based Reduction
– Datatype Processing
• Deep Learning Application: OSU Caffe
Presentation Outline
Overlap Symposium (April ’18) 10Network Based Computing Laboratory
MPI_Init
Application
Exchange Addresses
Obtain Endpoint AddressInitialize HCA
Compute / CommunicateSet Up ProblemRead Input Files
P0 P1 P2 P3
MPI_Init
Application
Exch
ange
Add
ress
es
Obtain Endpoint AddressInitialize HCA
P0 P1 P2 P3
Communication Independent Tasks
Set Up ProblemRead Input Files
Compute / Communicate
Overlapping Application Compute with MPI Startup
No Overlap between MPI_Init and Application Computation
MPI can continue to initialize in the background while Application starts
Overlap Symposium (April ’18) 11Network Based Computing Laboratory
• Near-constant MPI and OpenSHMEM initialization time at any process count
• 10x and 30x improvement in startup time of MPI and OpenSHMEM respectively at 16,384 processes
• Memory consumption reduced for remote endpoint information by O(processes per node)
• 1GB Memory saved per node with 1M processes and 16 processes per node
Towards High Performance and Scalable Startup at Exascale
P M
O
Job Startup Performance
Mem
ory
Requ
ired
to S
tore
En
dpoi
nt In
form
atio
na b c d
eP
M
PGAS – State of the art
MPI – State of the art
O PGAS/MPI – Optimized
PMIX_Ring
PMIX_Ibarrier
PMIX_Iallgather
Shmem based PMI
b
c
d
e
a On-demand Connection
On-demand Connection Management for OpenSHMEM and OpenSHMEM+MPI. S. Chakraborty, H. Subramoni, J. Perkins, A. A. Awan, and D K Panda, 20th International Workshop on High-level Parallel Programming Models and Supportive Environments (HIPS ’15)
PMI Extensions for Scalable MPI Startup. S. Chakraborty, H. Subramoni, A. Moody, J. Perkins, M. Arnold, and D K Panda, Proceedings of the 21st European MPI Users' Group Meeting (EuroMPI/Asia ’14)
Non-blocking PMI Extensions for Fast MPI Startup. S. Chakraborty, H. Subramoni, A. Moody, A. Venkatesh, J. Perkins, and D K Panda, 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid ’15)
SHMEMPMI – Shared Memory based PMI for Improved Performance and Scalability. S. Chakraborty, H. Subramoni, J. Perkins, and D K Panda, 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid ’16)
a
b
c d
e
Overlap Symposium (April ’18) 12Network Based Computing Laboratory
Startup Performance on KNL + Omni-Path
0
50
100
150
20064 12
825
651
2 1K 2K 4K 8K 16K
32K
64K
128K
181K
232K
MPI
_Ini
t (Se
cond
s)
Number of Processes
MPI_Init - TACC Stampede-KNL
Intel MPI 2018 beta
MVAPICH2 2.3a
0
5
10
15
20
25
64 128
256
512 1K 2K 4K 8K 16K
32K
64K
Tim
e Ta
ken
(Sec
onds
)
Number of Processes
MPI_Init & Hello World - Oakforest-PACS
Hello World (MVAPICH2-2.3a)
MPI_Init (MVAPICH2-2.3a)
• MPI_Init takes 51 seconds on 231,956 processes on 3,624 KNL nodes (Stampede – Full scale)• 8.8 times faster than Intel MPI at 128K processes (Courtesy: TACC)• At 64K processes, MPI_Init and Hello World takes 5.8s and 21s respectively (Oakforest-PACS)• All numbers reported with 64 processes per node
5.8s
21s
51s
8.8x
New designs available in MVAPICH2-2.3a and as patch for SLURM-15.08.8 and SLURM-16.05.1
Overlap Symposium (April ’18) 13Network Based Computing Laboratory
• SHMEMPMI allows MPI processes to directly read remote endpoint (EP) information from the process manager through shared memory segments
• Only a single copy per node - O(processes per node) reduction in memory usage
• Estimated savings of 1GB per node with 1 million processes and 16 processes per node
• Up to 1,000 times faster PMI Gets compared to default design
• Available since MVAPICH2 2.2rc1 and SLURM-15.08.8
Process Management Interface (PMI) over Shared Memory (SHMEMPMI)
050
100150200250300
1 2 4 8 16 32Tim
e Ta
ken
(mill
iseco
nds)
Number of Processes per Node
Time Taken by one PMI_GetDefault
SHMEMPMI
0.00010.001
0.010.1
110
1001000
10000
16 32 64 128
256
512 1K 2K 4K 8K 16K
32K
64K
128K
256K
512K 1M
Mem
ory U
sage
per
Nod
e (M
B)Number of Processes per Job
Memory Usage for Remote EP InformationFence - DefaultAllgather - DefaultFence - ShmemAllgather - Shmem
Estimated10
00x
Actual
16x
Overlap Symposium (April ’18) 14Network Based Computing Laboratory
On-demand Connection Management for OpenSHMEM+MPI
0
5
10
15
20
25
30
35
32 64 128 256 512 1K 2K 4K
Tim
e Ta
ken
(Sec
onds
)
Number of Processes
Breakdown of OpenSHMEM Startup
Connection Setup
PMI Exchange
Memory Registration
Shared Memory Setup
Other
0
20
40
60
80
100
120
16 32 64 128 256 512 1K 2K 4K 8K
Tim
e Ta
ken
(Sec
onds
)
Number of Processes
Performance of OpenSHMEM Initialization and Hello World
Hello World - Static
Initialization - Static
Hello World - On-demand
Initialization - On-demand
• Static connection establishment wastes memory and takes a lot of time
• On-demand connection management improves OpenSHMEM initialization time by 29.6 times
• Time taken for Hello World reduced by 8.31 times at 8,192 processes
• Available since MVAPICH2-X 2.1rc1
Overlap Symposium (April ’18) 15Network Based Computing Laboratory
Using SLURM as launcher• Use PMI2
– ./configure --with-pm=slurm --with-pmi=pmi2
– srun --mpi=pmi2 ./a.out
• Use PMI Extensions
– Patch for SLURM available at http://mvapich.cse.ohio-state.edu/download/
– Patches available for SLURM 15, 16, and 17
– PMI Extensions are automatically detected by MVAPICH2
Using mpirun_rsh as launcher
• MV2_MT_DEGREE– degree of the hierarchical tree used by
mpirun_rsh
• MV2_FASTSSH_THRESHOLD– #nodes beyond which hierarchical-ssh scheme
is used
• MV2_NPROCS_THRESHOLD– #nodes beyond which file-based
communication is used for hierarchical-ssh
How to Get the Best Startup Performance with MVAPICH2?
• MV2_HOMOGENEOUS_CLUSTER=1 //Set for homogenous clusters
• MV2_ON_DEMAND_UD_INFO_EXCHANGE=1 //Enable UD based address exchange
Overlap Symposium (April ’18) 16Network Based Computing Laboratory
• MVAPICH2/MVAPICH2-X– Job Startup
– Point-to-point Communication
– Remote Memory Access (RMA)
– Collective Communication
• MVAPICH2-GDR– Support for InfiniBand Core-Direct
– GPU-kernel based Reduction
– Datatype Processing
• Deep Learning Application: OSU Caffe
Presentation Outline
Overlap Symposium (April ’18) 17Network Based Computing Laboratory
Application
MPI Library
High-Performance Networks
– Good communication performance for smaller messages
– No synchronization required between sender and receiver
– Cost of extra copies is high for large messages
Communication Costs of Point-to-point Protocols - Eager
Application Data
Pre-registered Communication Buffers
Pre-registered Communication Buffers
Buffer #1 Buffer #1Buffer #n Buffer #n
Application Data
Cost: Memcpy
Cost:Memcpy
Cost:Network Transfer
Overlap Symposium (April ’18) 18Network Based Computing Laboratory
Communication Costs of Point-to-point Protocols - Rendezvous
Cost:Half RTT
Cost:Half RTT
Cost:Network Transfer
Cost:Half RTT
– Avoid extra copies for larger messages
– Synchronization required between sender and receiver
– Can be based on RDMA Read or RDMA Write (shown here)
Overlap Symposium (April ’18) 19Network Based Computing Laboratory
• Application processes schedule communication operation
• Network adapter progresses communication in the background
• Application process free to perform useful compute in the foreground
• Overlap of computation and communication => Better Overall Application Performance
• Increased buffer requirement
• Poor communication performance if used for all types of communication operations
Analyzing Overlap Potential of Eager Protocol
ApplicationProcess
ApplicationProcess
Network InterfaceCard
Network InterfaceCard
ScheduleSend
Operation
ScheduleReceive
Operation
Check forCompletion
Check forCompletion
Complete Complete
Impact of changing Eager Threshold on performance of multi-pairmessage-rate benchmark with 32 processes on Stampede
Computation Communication Progress
Overlap Symposium (April ’18) 20Network Based Computing Laboratory
• Application processes schedule communication operation
• Application process free to perform useful compute in the foreground
• Little communication progress in the background
• All communication takes place at final synchronization
• Reduced buffer requirement
• Good communication performance if used for large message sizes and operations where communication library is progressed frequently
• Poor overlap of computation and communication => Poor Overall Application Performance
Analyzing Overlap Potential of Rendezvous ProtocolApplication
ProcessApplication
ProcessNetwork Interface
CardNetwork Interface
Card
ScheduleSend
Operation
ScheduleReceive
Operation
RTS
Check forCompletion
Check forCompletion
Not Complete
Not Complete
CTS
Check forCompletion
Check forCompletion
Not Complete
Not Complete
Check forCompletion
Check forCompletion
Complete
Complete
Computation Communication Progress
Overlap Symposium (April ’18) 21Network Based Computing Laboratory
Impact of Tuning Rendezvous Threshold on 3D-Stencil
0
5
10
15
20
25
30
351 4 16 64 256 1K 4K 16K
64K
256K
Late
ncy
(ms)
Message Size (Bytes)
Communication Time
Default
Tuned
0
20
40
60
80
100
1 4 16 64 256 1K 4K 16K
64K
256K
Ove
rlap
(%)
Message Size (Bytes)
Overlap Potential
Default
Tuned0
10
20
30
40
50
60
1 4 16 64 256 1K 4K 16K
64K
256K
Late
ncy
(ms)
Message Size (Bytes)
Overall Performance
Default
Tuned
• Increased eager threshold from 16KB to 512KB
• Very small degradation in raw communication performance
• Significant improvement in overlap of computation and communication
• ~18% Improvement in overall performance
MV2_IBA_EAGER_THRESHOLD=512KMV2_SMP_EAGERSIZE=512K
(Applicable to both InfiniBand and Omni-Path)
8192 Processes, SandyBridge + FDR
Overlap Symposium (April ’18) 22Network Based Computing Laboratory
Impact of Tuning Rendezvous Protocol on 3D-Stencil
• RDMA Read based protocol (RGET) used instead of RDMA Write
• Very minor penalty in raw performance
• Offers more overlap due to less synchronization overhead
• Up to 15% improvement in overall execution time
MV2_RNDV_PROTOCOL=RGET
(Applicable to InfiniBand)
64 Processes, Broadwell + EDR
012345678
Late
ncy
(us)
Message Size (Bytes)
Communication Time
Default
Tuned
0
20
40
60
80
100
Ove
rlap
(%)
Message Size (Bytes)
Overlap Potential
Default
Tuned
0
2
4
6
8
10
12
Late
ncy
(us)
Message Size (Bytes)
Overall Performance
Default
Tuned
Overlap Symposium (April ’18) 23Network Based Computing Laboratory
Dynamic and Adaptive MPI Point-to-point Communication Protocols
Default Poor overlap; Low memory requirement Low Performance; High Productivity
Manually Tuned Good overlap; High memory requirement High Performance; Low Productivity
Dynamic + Adaptive Good overlap; Optimal memory requirement
High Performance; High Productivity
• Different communication protocols have different trade-offs– Need to consider performance, overlap, memory requirement
– Manual tuning is difficult and time-consuming
• Can the MPI library select the best protocol at runtime?– Use different protocols and thresholds between different pair of processes
– Deliver good performance and minimize resource consumption
– Dynamically adapt to the application’s communication requirements at runtime
Overlap Symposium (April ’18) 24Network Based Computing Laboratory
Dynamic and Adaptive MPI Point-to-point Communication Protocols (cont.)
Process on Node 1 Process on Node 2
Eager Threshold for Example Communication Pattern with Different Designs
0 1 2 3
4 5 6 7
Default
16 KB 16 KB 16 KB 16 KB
0 1 2 3
4 5 6 7
Manually Tuned
128 KB 128 KB 128 KB 128 KB
0 1 2 3
4 5 6 7
Dynamic + Adaptive
32 KB 64 KB 128 KB 32 KB
H. Subramoni, S. Chakraborty, D. K. Panda, Designing Dynamic & Adaptive MPI Point-to-Point Communication Protocols for Efficient Overlap of Computation & Communication, ISC'17 - Best Paper
0
100
200
300
400
500
600
128 256 512 1K
Wal
l Clo
ck T
ime
(sec
onds
)
Number of Processes
Execution Time of Amber
Default Threshold=17K Threshold=64K
Threshold=128K Dynamic Threshold
0
2
4
6
8
10
128 256 512 1KRela
tive
Mem
ory
Cons
umpt
ion
Number of Processes
Relative Memory Consumption of Amber
Default Threshold=17K Threshold=64K
Threshold=128K Dynamic Threshold
Process Pair Eager Threshold (KB)
0 – 4 32
1 – 5 64
2 – 6 128
3 – 7 32
Desired Eager Threshold
Overlap Symposium (April ’18) 25Network Based Computing Laboratory
• MVAPICH2/MVAPICH2-X– Job Startup
– Point-to-point Communication
– Remote Memory Access (RMA)
– Collective Communication
• MVAPICH2-GDR– Support for InfiniBand Core-Direct
– GPU-kernel based Reduction
– Datatype Processing
• Deep Learning Application: OSU Caffe
Presentation Outline
Overlap Symposium (April ’18) 26Network Based Computing Laboratory
• Non-blocking one-sided communication routines
– Put, Get (Rput, Rget)
– Accumulate, Get_accumulate
– Atomics
• Flexible synchronization operations to control initiation and completion
MPI-3 RMA: Communication and synchronization Primitives
MPI One-sided Synchronization/Completion Primitives
Synchronization Completion Win_sync
Lock/Unlock
Lock_all/Unlock_all
Fence
Post-Wait/Start-Complete
Flush
Flush_all
Flush_local
Flush_local_all
MVAPICH2 supports all RMA communication with
Best performance and overlap
Overlap Symposium (April ’18) 27Network Based Computing Laboratory
0
20
40
60
80
100
1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M
Ove
rlap
(%)
Message Size (Bytes)
MPI_Put with Lock/Unlock and MPI_Win_flush
MVAPICH2-2.3rc1
MVAPICH2-2.3rc1Intel Haswell (E5-2687W @ 3.10 GHz) node - 20 cores
Mellanox Connect-X4 EDR HCAMellanox OFED 4.3
Overlap between Computation and RMA Operations
• 67-99% overlap between MPI_Put and computation
• 75-99% overlap between MPI_Get and computation
0
20
40
60
80
100
1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M
Ove
rlap
(%)
Message Size (Bytes)
MPI_Get with Lock/Unlock and MPI_Win_flush
MVAPICH2-2.3rc1
Overlap Symposium (April ’18) 28Network Based Computing Laboratory
• Proposed design performs better than default implementation
• For Weakly Connected Components (WCC) on 256 cores, proposed design could reduce the total execution time by 2X compared with the default scheme
Graph Processing Framework with Optimized MPI RMA
0
100
200
300
400
500
64 128 256
Exec
utio
nTi
me
(s)
#Cores
Mizan-DefaultMizan-RMA-Opt
Better
PageRank with LiveJournal1
020406080
100120140160
64 128 256
Exec
utio
nTi
me
(s)
#Cores
Mizan-DefaultMizan-RMA-Opt
WCC with LiveJournal1
2X3X
M. Li, X. Lu, K. Hamidouche, J. Zhang and D. K. Panda, "Mizan-RMA: Accelerating Mizan Graph Processing Framework with MPI RMA," IEEE HiPC, 2016
Overlap Symposium (April ’18) 29Network Based Computing Laboratory
• Proposed design shows good strong scaling
• Proposed design scales better than default implementation
Graph Processing Framework with Optimized MPI RMA
0
200
400
600
800
1000
1200
1400
256 384 512 640
Tota
l Exe
cutio
n Ti
me
(s)
Applications (128 Processes)
Mizan-DefaultMizan-RMA-Opt
2.5X
PageRank with Arabic Dataset
Better
M. Li, X. Lu, K. Hamidouche, J. Zhang and D. K. Panda, "Mizan-RMA: Accelerating Mizan Graph Processing Framework with MPI RMA," IEEE HiPC, 2016
Overlap Symposium (April ’18) 30Network Based Computing Laboratory
• MVAPICH2/MVAPICH2-X– Job Startup
– Point-to-point Communication
– Remote Memory Access (RMA)
– Collective Communication
• MVAPICH2-GDR– Support for InfiniBand Core-Direct
– GPU-kernel based Reduction
– Datatype Processing
• Deep Learning Application: OSU Caffe
Presentation Outline
Overlap Symposium (April ’18) 31Network Based Computing Laboratory
Collective Communication in MVAPICH2
Run-time flags:All shared-memory based collectives : MV2_USE_SHMEM_COLL (Default: ON)Hardware Mcast-based collectives : MV2_USE_MCAST (Default : OFF)CMA-based collectives : MV2_USE_CMA_COLL (Default : ON)
Multi/Many-Core Aware Designs
Blocking and Non-Blocking Collective Algorithms in MV2
Conventional (Flat)
Inter-NodeCommunication
Intra-Node Communication
Point to Point(SHMEM,
LiMIC, CMA, XPMEM)
Direct Shared Memory
Direct Kernel Assisted
(CMA, XPMEM, LiMIC)
Point to Point
Hardware Multicast SHARP RDMA
Designed for Performance & Overlap
Overlap Symposium (April ’18) 32Network Based Computing Laboratory
Hardware Multicast-aware MPI_Bcast on TACC Stampede
05
10152025303540
2 8 32 128 512
Late
ncy
(us)
Message Size (Bytes)
Small Messages (102,400 Cores)DefaultMulticast
0
100
200
300
400
500
2K 8K 32K 128K
Late
ncy
(us)
Message Size (Bytes)
Large Messages (102,400 Cores)DefaultMulticast
0
10
20
30
Late
ncy
(us)
Number of Nodes
16 Byte Message
DefaultMulticast
0
50
100
150
200
Late
ncy
(us)
Number of Nodes
32 KByte Message
DefaultMulticast
• MCAST-based designs improve latency of MPI_Bcast by up to 85%
• Use MV2_USE_MCAST=1 to enable MCAST-based designs
80%
85%
Overlap Symposium (April ’18) 33Network Based Computing Laboratory
Optimized CMA-based Collectives for Large Messages
1
10
100
1000
10000
100000
10000001K 2K 4K 8K 16
K32
K64
K12
8K25
6K51
2K 1M 2M 4MMessage Size
KNL (2 Nodes, 128 Procs)
MVAPICH2-2.3a
Intel MPI 2017
OpenMPI 2.1.0
Tuned CMA
Late
ncy
(us)
1
10
100
1000
10000
100000
1000000
1K 2K 4K 8K 16K
32K
64K
128K
256K
512K 1M 2M
Message Size
KNL (4 Nodes, 256 Procs)
MVAPICH2-2.3a
Intel MPI 2017
OpenMPI 2.1.0
Tuned CMA1
10
100
1000
10000
100000
1000000
1K 2K 4K 8K 16K
32K
64K
128K
256K
512K 1M
Message Size
KNL (8 Nodes, 512 Procs)
MVAPICH2-2.3a
Intel MPI 2017
OpenMPI 2.1.0
Tuned CMA
• Significant improvement over existing implementation for Scatter/Gather with 1MB messages (up to 4x on KNL, 2x on Broadwell, 14x on OpenPower)
• New two-level algorithms for better scalability• Improved performance for other collectives (Bcast, Allgather, and Alltoall)
~ 2.5xBetter
~ 3.2xBetter
~ 4xBetter
~ 17xBetter
S. Chakraborty, H. Subramoni, and D. K. Panda, Contention Aware Kernel-Assisted MPI Collectives for Multi/Many-core Systems, IEEE Cluster ’17, BEST Paper Finalist
Performance of MPI_Gather on KNL nodes (64PPN)
Available in MVAPICH2-X 2.3b
Overlap Symposium (April ’18) 34Network Based Computing Laboratory
Shared Address Space (XPMEM)-based Collectives Design
1
10
100
1000
10000
100000
16K 32K 64K 128K 256K 512K 1M 2M 4M
Late
ncy
(us)
Message Size
MVAPICH2-2.3bIMPI-2017v1.132MVAPICH2-Opt
OSU_Allreduce (Broadwell 256 procs)
• “Shared Address Space”-based true zero-copy Reduction collective designs in MVAPICH2
• Offloaded computation/communication to peers ranks in reduction collective operation
• Up to 4X improvement for 4MB Reduce and up to 1.8X improvement for 4M AllReduce
73.2
1.8X
1
10
100
1000
10000
100000
16K 32K 64K 128K 256K 512K 1M 2M 4MMessage Size
MVAPICH2-2.3b
IMPI-2017v1.132
MVAPICH2-Opt
OSU_Reduce (Broadwell 256 procs)
4X
36.1
37.9
16.8
J. Hashmi, S. Chakraborty, M. Bayatpour, H. Subramoni, and D. Panda, Designing Efficient Shared Address Space Reduction Collectives for Multi-/Many-cores, International Parallel & Distributed Processing Symposium (IPDPS '18), May 2018.
Will be available in future
Overlap Symposium (April ’18) 35Network Based Computing Laboratory
Application-Level Benefits of XPMEM-Based Collectives
MiniAMR (Broadwell, ppn=16)
• Up to 20% benefits over IMPI for CNTK DNN training using AllReduce• Up to 27% benefits over IMPI and up to 15% improvement over MVAPICH2 for
MiniAMR application kernel
0
200
400
600
800
28 56 112 224
Exec
utio
n Ti
me
(s)
No. of Processes
IMPI-2017v1.132MVAPICH2-2.3bMVAPICH2-Opt
CNTK AlexNet Training (Broadwell, B.S=default, iteration=50, ppn=28)
0
20
40
60
80
16 32 64 128 256
Exec
utio
n Ti
me
(s)
No. of Processes
IMPI-2017v1.132MVAPICH2-2.3bMVAPICH2-Opt20%
9%
27%
15%
Overlap Symposium (April ’18) 36Network Based Computing Laboratory
Problems with Blocking Collective OperationsApplication
ProcessApplication
ProcessApplication
ProcessApplication
ProcessComputation
Communication
• Communication time cannot be used for compute– No overlap of computation and communication
– Inefficient
Overlap Symposium (April ’18) 37Network Based Computing Laboratory
• Application processes schedule collective operation
• Check periodically if operation is complete
• Overlap of computation and communication => Better Performance
• Catch: Who will progress communication
Concept of Non-blocking CollectivesApplication
ProcessApplication
ProcessApplication
ProcessApplication
Process
Computation
Communication
CommunicationSupport Entity
CommunicationSupport Entity
CommunicationSupport Entity
CommunicationSupport Entity
ScheduleOperation
ScheduleOperation
ScheduleOperation
ScheduleOperation
Check ifComplete
Check ifComplete
Check ifComplete
Check ifComplete
Check ifComplete
Check ifComplete
Check ifComplete
Check ifComplete
Overlap Symposium (April ’18) 38Network Based Computing Laboratory
• Enables overlap of computation with communication
• Non-blocking calls do not match blocking collective calls– MPI may use different algorithms for blocking and non-blocking collectives
– Blocking collectives: Optimized for latency
– Non-blocking collectives: Optimized for overlap
• A process calling a NBC operation– Schedules collective operation and immediately returns
– Executes application computation code
– Waits for the end of the collective
• The communication progress by– Application code through MPI_Test
– Network adapter (HCA) with hardware support
– Dedicated processes / thread in MPI library
• There is a non-blocking equivalent for each blocking operation – Has an “I” in the name
• MPI_Bcast -> MPI_Ibcast; MPI_Reduce -> MPI_Ireduce
Non-blocking Collective (NBC) Operations
Overlap Symposium (April ’18) 39Network Based Computing Laboratory
void main()
{
MPI_Init()
…..
MPI_Ialltoall(…)
Computation that does not depend on result of Alltoall
MPI_Test(for Ialltoall) /* Check if complete (non-blocking) */
Computation that does not depend on result of Alltoall
MPI_Wait(for Ialltoall) /* Wait till complete (Blocking) */
…
MPI_Finalize()
}
How do I write applications with NBC?
Overlap Symposium (April ’18) 40Network Based Computing Laboratory
P3DFFT Performance with Non-Blocking Alltoall using RDMA Primitives
• Weak scaling experiments; problem size increases with job size
• RDMA-Aware delivers 19% improvement over Default @ 8,192 procs
• Default-Thread exhibits worst performance– Possibly because threads steal CPU cycles from P3DFFT
– Do not consider for large scale experiments
0
2
4
6
8
10
12
14
128 256 512 1K 2K 4K 8K
CPU
Tim
e Pe
r Loo
p (S
econ
ds)
Number of Processes
Large Scale Runs
Default RDMA-Aware
02468
10121416
128 256 512CPU
Tim
e Pe
r Loo
p (S
econ
ds)
Number of Processes
Small Scale Runs
Default RDMA-Aware Default-Thread 19%
Designing Non-Blocking Personalized Collectives with Near Perfect Overlap for RDMA-Enabled Clusters, H. Subramoni , A. Awan , K. Hamidouche , D. Pekurovsky , A. Venkatesh , S. Chakraborty , K. Tomko , and D. K. Panda, ISC '15, Jul 2015
Will be available in future
Overlap Symposium (April ’18) 41Network Based Computing Laboratory
Management and execution of MPI operations in the network by using SHArP Manipulation of data while it is being transferred in the switch
network
SHArP provides an abstraction to realize the reduction operation Defines Aggregation Nodes (AN), Aggregation Tree, and
Aggregation Groups
AN logic is implemented as an InfiniBand Target Channel Adapter (TCA) integrated into the switch ASIC *
Uses RC for communication between ANs and between AN and hosts in the Aggregation Tree *
Offloading with Scalable Hierarchical Aggregation Protocol (SHArP)
Physical Network Topology*
Logical SHArP Tree** Bloch et al. Scalable Hierarchical Aggregation Protocol (SHArP): A Hardware Architecture for Efficient Data Reduction
Overlap Symposium (April ’18) 42Network Based Computing Laboratory
0123456789
4 8 16 32 64 128
Pure
Com
mun
icat
ion
Late
ncy
(us)
Message Size (Bytes)
1 PPN*, 8 NodesMVAPICH2
MVAPICH2-SHArP
05
101520253035404550
4 8 16 32 64 128Com
mun
icat
ion-
Com
puta
tion
Ove
rlap
(%)
Message Size (Bytes)
1 PPN, 8 NodesMVAPICH2
MVAPICH2-SHArP
Evaluation of SHArP based Non Blocking Allreduce
MPI_Iallreduce Benchmark
2.3x
*PPN: Processes Per Node
• Complete offload of Allreduce collective operation to Switch helps to have much higher overlap of communication and computation
Lower is Better
High
er is
Bet
ter
Available since MVAPICH2 2.3a
Overlap Symposium (April ’18) 43Network Based Computing Laboratory
• Mellanox’s ConnectX-2, ConnectX-3, ConnectIB, ConnectX-4, and ConnectX-5 adapters feature “task-list” offload interface
– Extension to existing InfiniBand APIs
• Collective communication with `blocking’ feature is usually a scaling bottleneck– Matches with the need for non-blocking collective in MPI
• Accordingly MPI software stacks need to be re-designed to leverage offload in a comprehensive manner
• Can applications be modified to take advantage of non-blocking collectives and what will be the benefits?
Collective Offload in ConnectX-2, ConnectX-3, Connect-IB and ConnectX-4, ConnectX-5
Overlap Symposium (April ’18) 44Network Based Computing Laboratory
Application
Collective Offload Support in ConnectX InfiniBand Adapter (Recv followed by Multi-Send)
• Sender creates a task-list consisting of only send and wait WQEs
– One send WQE is created for each registered receiver and is appended to the rear of a singly linked task-list
– A wait WQE is added to make the ConnectX-2 HCA wait for ACK packet from the receiver
InfiniBand HCA
Physical Link
Send Q
Recv Q
Send CQ
Recv CQ
DataData
MCQ
MQ
Task ListSend WaitSendSendSend Wait
Overlap Symposium (April ’18) 45Network Based Computing Laboratory
Co-designing HPL with Core-Direct and Performance Benefits
0
0.2
0.4
0.6
0.8
1
1.2
10 20 30 40 50 60 70Nor
mal
ized
HPL
Per
form
ance
HPL Problem Size (N) as % of Total Memory
HPL-Offload HPL-1ring HPL-Host
HPL Performance Comparison with 512 Processes HPL-Offload consistently offers higher throughput than HPL-1ring and HPL-Host. Improves peak throughput by up to 4.5 % for large problem sizes
4.5%
0500100015002000250030003500400045005000
0102030405060708090
64 128 256 512
Thro
ughp
ut (G
Flop
s)
Mem
ory
Con
sum
ptio
n (%
)
System Size (Number of Processes)
HPL-Offload HPL-1ring HPL-HostHPL-Offload HPL-1ring HPL-Host
HPL-Offload surpasses the peak throughput of HPL-1ring with significantly smaller problem sizes and run-times!
K. Kandalla, H. Subramoni, J. Vienne, S. Pai Raikar, K. Tomko, S. Sur, and D K Panda,Designing Non-blocking Broadcast with Collective Offload on InfiniBand Clusters: A Case Study with HPL, (HOTI 2011)
Overlap Symposium (April ’18) 46Network Based Computing Laboratory
Pre-conditioned Conjugate Gradient (PCG) Solver Performance with Non-Blocking Allreduce based on CX-2 Collective Offload
0
5
10
15
64 128 256 512
Run-
Tim
e (s
)
Number of Processes
PCG-Default Modified-PCG-Offload
64,000 unknowns per process.Modified PCG with Offload-Allreduce performs 21% better than default PCG
21.8%
K. Kandalla, U. Yang, J. Keasler, T. Kolev, A. Moody, H. Subramoni, K. Tomko, J. Vienne and D. K. Panda, Designing Non-blockingAllreduce with Collective Offload on InfiniBand Clusters: A Case Study with Conjugate Gradient Solvers, IPDPS ’12, May 2012.
Overlap Symposium (April ’18) 47Network Based Computing Laboratory
• MVAPICH2/MVAPICH2-X– Job Startup
– Point-to-point Communication
– Remote Memory Access (RMA)
– Collective Communication
• MVAPICH2-GDR– Support for InfiniBand Core-Direct
– GPU-kernel based Reduction
– Datatype Processing
• Deep Learning Application: OSU Caffe
Presentation Outline
Overlap Symposium (April ’18) 48Network Based Computing Laboratory
At Sender:
At Receiver:MPI_Recv(r_devbuf, size, …);
insideMVAPICH2
• Standard MPI interfaces used for unified data movement
• Takes advantage of Unified Virtual Addressing (>= CUDA 4.0)
• Overlaps data movement from GPU with RDMA transfers
High Performance and High Productivity
MPI_Send(s_devbuf, size, …);
GPU-Aware (CUDA-Aware) MPI Library: MVAPICH2-GPU
Overlap Symposium (April ’18) 49Network Based Computing Laboratory
CUDA-Aware MPI: MVAPICH2-GDR 1.8-2.3 Releases
• Support for MPI communication from NVIDIA GPU device memory• High performance RDMA-based inter-node point-to-point communication
(GPU-GPU, GPU-Host and Host-GPU)• High performance intra-node point-to-point communication for multi-GPU
adapters/node (GPU-GPU, GPU-Host and Host-GPU)• Taking advantage of CUDA IPC (available since CUDA 4.1) in intra-node
communication for multiple GPU adapters/node• Optimized and tuned collectives for GPU device buffers• MPI datatype support for point-to-point and collective communication from
GPU device buffers• Unified memory
Overlap Symposium (April ’18) 50Network Based Computing Laboratory
• OFED with support for GPUDirect RDMA is developed by NVIDIA and Mellanox
• OSU has a design of MVAPICH2 using GPUDirect RDMA
– Hybrid design using GPU-Direct RDMA• GPUDirect RDMA and Host-based pipelining
• Alleviates P2P bandwidth bottlenecks on SandyBridge and IvyBridge
• Similar bottlenecks on Haswell
– Support for communication using multi-rail
– Support for Mellanox Connect-IB and ConnectX VPI adapters
– Support for RoCE with Mellanox ConnectX VPI adapters
GPU-Direct RDMA (GDR) with CUDA
IB Adapter
SystemMemory
GPUMemory
GPU
CPU
Chipset
SNB E5-2670 IVB E5-2680V2
SNB E5-2670 /
IVB E5-2680V2
Intra-socket Inter-sockets Intra-socket Inter-sockets
P2P read <1.0 GBs <300 MBs 3.5 GBs <300 MBs
P2P write 5.2 GBs <300 MBs 6.4 GBs <300 MBs
Overlap Symposium (April ’18) 51Network Based Computing Laboratory
0
2000
4000
6000
1 2 4 8 16 32 64 128
256
512 1K 2K 4K
Band
wid
th (M
B/s)
Message Size (Bytes)
GPU-GPU Inter-node Bi-Bandwidth
MV2-(NO-GDR) MV2-GDR-2.3a
01000200030004000
1 2 4 8 16 32 64 128
256
512 1K 2K 4K
Band
wid
th (M
B/s)
Message Size (Bytes)
GPU-GPU Inter-node Bandwidth
MV2-(NO-GDR) MV2-GDR-2.3a
0
10
20
300 1 2 4 8 16 32 64 128
256
512 1K 2K 4K 8K
Late
ncy
(us)
Message Size (Bytes)
GPU-GPU Inter-node Latency
MV2-(NO-GDR) MV2-GDR-2.3a
MVAPICH2-GDR-2.3aIntel Haswell (E5-2687W @ 3.10 GHz) node - 20 cores
NVIDIA Volta V100 GPUMellanox Connect-X4 EDR HCA
CUDA 9.0Mellanox OFED 4.0 with GPU-Direct-RDMA
10x
9x
Optimized MVAPICH2-GDR Design
1.88us11X
Overlap Symposium (April ’18) 52Network Based Computing Laboratory
020406080
100
1 4 16 64 256 1K 4K 16K
64K
256K 1M
Ove
rlap
(%)
Message Size (Bytes)
GPU-GPU Inter-node Overlap*
MVAPICH2-(NO-GDR) MVAPCH2-GDR-2.3a
020406080
1 4 16 64 256 1K 4K 16K
64K
256K 1M
Ove
rlap
(%)
Message Size (Bytes)
GPU-GPU Intra-node Overlap*
MVAPICH2-GDR-2.3a
MVAPICH2-GDR-2.3aIntel Haswell (E5-2687W @ 3.10 GHz) node - 20 cores
NVIDIA Volta V100 GPUMellanox Connect-X4 EDR HCA
CUDA 9.0Mellanox OFED 4.0 with GPU-Direct-RDMA
Overlap with Optimized MVAPICH2-GDR Design
• Up to 69% overlap* for intra-node GPU-GPU communication
• With GDR, up to 78% overlap* for inter-node small and medium message transfers
• With intelligent pipeline, up to 88% overlap* for inter-node large message transfers
*Overlap between GPU-to-GPU communication and CPU computation
Overlap Symposium (April ’18) 53Network Based Computing Laboratory
• Platform: Wilkes (Intel Ivy Bridge + NVIDIA Tesla K20c + Mellanox Connect-IB)• HoomdBlue Version 1.0.5
• GDRCOPY enabled: MV2_USE_CUDA=1 MV2_IBA_HCA=mlx5_0 MV2_IBA_EAGER_THRESHOLD=32768 MV2_VBUF_TOTAL_SIZE=32768 MV2_USE_GPUDIRECT_LOOPBACK_LIMIT=32768 MV2_USE_GPUDIRECT_GDRCOPY=1 MV2_USE_GPUDIRECT_GDRCOPY_LIMIT=16384
Application-Level Evaluation (HOOMD-blue)
0
500
1000
1500
2000
2500
4 8 16 32
Aver
age
Tim
e St
eps p
er
seco
nd (T
PS)
Number of Processes
MV2 MV2+GDR
0500
100015002000250030003500
4 8 16 32Aver
age
Tim
e St
eps p
er
seco
nd (T
PS)
Number of Processes
64K Particles 256K Particles
2X2X
Overlap Symposium (April ’18) 54Network Based Computing Laboratory
Application-Level Evaluation (Cosmo) and Weather Forecasting in Switzerland
0
0.2
0.4
0.6
0.8
1
1.2
16 32 64 96Nor
mal
ized
Exec
utio
n Ti
me
Number of GPUs
CSCS GPU cluster
Default Callback-based Event-based
00.20.40.60.8
11.2
4 8 16 32
Nor
mal
ized
Exec
utio
n Ti
me
Number of GPUs
Wilkes GPU Cluster
Default Callback-based Event-based
• 2X improvement on 32 GPUs nodes• 30% improvement on 96 GPU nodes (8 GPUs/node)
C. Chu, K. Hamidouche, A. Venkatesh, D. Banerjee , H. Subramoni, and D. K. Panda, Exploiting Maximal Overlap for Non-Contiguous Data Movement Processing on Modern GPU-enabled Systems, IPDPS’16
On-going collaboration with CSCS and MeteoSwiss (Switzerland) in co-designing MV2-GDR and Cosmo Application
Cosmo model: http://www2.cosmo-model.org/content/tasks/operational/meteoSwiss/
Overlap Symposium (April ’18) 55Network Based Computing Laboratory
• MVAPICH2/MVAPICH2-X– Job Startup
– Point-to-point Communication
– Remote Memory Access
– Collective Communication
• MVAPICH2-GDR– Support for InfiniBand CORE-Direct
– GPU-kernel based Reduction
– Datatype Processing
• OSU Caffe
Presentation Outline
Overlap Symposium (April ’18) 56Network Based Computing Laboratory
• Applications use GPU/CPU resources for computation and MPI for communication directly from GPU buffers
• MPI collectives common in GPU applications. E.g.: Alltoall for FFTs
• Collectives are time consuming with scale so MPI-3.0 introduced NBCs
• Non-blocking communication operations from GPU buffers can– Allow CPU to overlap GPU-based communication with CPU compute
– Ease GPU kernels redundancy in waiting for non-dependent communication
– Allow power efficient execution from CPU perspective
• Rich set of GPU and network primitives available for NBC designs but architectural limitations must be addressed
Motivation: Exploiting CORE-Direct and GPUDirect RDMA
A. Venkatesh, K. Hamidouche, H. Subramoni, and D. K. Panda, Offloaded GPU Collectives using CORE-Direct and CUDA Capabilities on IB Clusters, HiPC ’15
Overlap Symposium (April ’18) 57Network Based Computing Laboratory
• Realized through mapping of MPICH schedule abstraction– Schedule composed of sched_send, sched_barrier, sched_recv, sched_start etc
– Mapped to Core-Direct primitives with collective-specific GPU↔Host done additionally
• Multiple designs explored– Naïve Design: Host-assisted GPU NBC (Scatter)
– Offload-Staged: Host-assisted GPU NBC + Core-Direct
– Offload-GDR: (GDR + Core-Direct)-based NBC
– Offload-Callback: (Core-Direct, GDR, CUDA)-based NBC
Overview of Core-Direct + GPUDirect Designs
Overlap Symposium (April ’18) 58Network Based Computing Laboratory
• Use of GDR and CUDA callback mechanisms improve latency (comparable for alltoall)
• Latency high for the case of alltoall even though callback designed to avoid staging latency
Latency Comparison with Blocking Collectives
3000
13000
23000
33000
43000
53000
63000
4K 64K 1M B
LATE
NCY
(US)
MESSAGE SIZE (BYTES)
6 4 - NODE I A L LGATHE R L A T ENCY
gdr-offload-cbstaged-offloadoffload-gdrBlocking
3000
13000
23000
33000
43000
53000
63000
73000
4K 64K 1M B
LATE
NCY
(US)
MESSAGE SIZE (BYTES)
6 4 - NODE I A L LTOALLL A T ENCY
gdr-offload-cbstaged-offloadoffload-gdrBlocking
Overlap Symposium (April ’18) 59Network Based Computing Laboratory
• New schemes are able to exploit overlap well
Effect of Compute Location on Overlap/Latency
0102030405060708090
100
4K 8K 16K 32K 64K 128K 256K
OVER
LAP(
%)
MESSAGE SIZE (BYTES)
6 4-NODE I ALLGATHEROVERLAP
gdr-offload-cb-GPUgdr-offload-cb-CPUoffload-gdr-GPUoffload-gdr-CPU
0100020003000400050006000700080009000
4K 8K 16K 32K 64K 128K 256K
LATE
NCY
(US)
MESSAGE SIZE (BYTES)
6 4-NODE I ALLGATHERL ATENCY
gdr-offload-cb-GPUgdr-offload-cb-CPUoffload-gdr-GPUoffload-gdr-CPU
Available in MVAPICH2-GDR 2.3a
Overlap Symposium (April ’18) 60Network Based Computing Laboratory
• MVAPICH2/MVAPICH2-X– Job Startup
– Point-to-point Communication
– Remote Memory Access (RMA)
– Collective Communication
• MVAPICH2-GDR– Support for InfiniBand CORE-Direct
– GPU-kernel based Reduction
– Datatype Processing
• Deep Learning Application: OSU Caffe
Presentation Outline
Overlap Symposium (April ’18) 61Network Based Computing Laboratory
• Scientific parallel applications spend a considerable amount of time inGPU-based collective communication operations
– E.g. Deep learning frameworks such as TensorFlow and Caffe
• Optimized computation-intensive collectives in MVAPICH2-GDR– MPI_Reduce and MPI_Allreduce
– Exploring the best combinations• Computation on
– CPU or GPU
• Communication through
– Host or GPU memory
GPU-kernel based Reduction
CPU
Host Memory
GPU
PCIe IB Adapter
CPU
Host Memory
GPU
PCIeIB Adapter1
2
3
4
1
2
Node BNode A
Overlap Symposium (April ’18) 62Network Based Computing Laboratory
16K 64K 256K 1M 4M
Late
ncy
(us)
Message Size (Bytes)
Default BD-DDGR-DD GR-HDGR-HH GR-H-HH
0
50
100
150
200
250
4 8 16 32 64 128 256 512 1K 2K 4K 8K
Late
ncy
(us)
Message Size (Bytes)
Default BD-DD GR-DD
GR-HD GR-HH GR-H-HH
Evaluation - MPI_Reduce @ CSCS (96 GPUs)
Gather-first approaches* win for small messages
K-nomial GPU-based approach* win for large messages
*Ching-Hsiang Chu, Khaled Hamidouche, Akshay Venkatesh, Ammar Ahmad Awan, and Dhabaleswar K. Panda, "CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters, ” IEEE/ACM CCGrid’16.
Overlap Symposium (April ’18) 63Network Based Computing Laboratory
0
5
10
15
20
25
Late
ncy
(ms)
Message Size (Bytes)
Default
RD-DD
BRB-DD
0
2
4
6
8
10
2 4 8 16 32
Late
ncy
(ms)
System Size (Number of Nodes)
Default
RD-DD
BRB-DD
Evaluation - MPI_Allreduce
96 GPUs @ CSCS
Good Scalability32 GPU Nodes @ Wilkes
Ching-Hsiang Chu, Khaled Hamidouche, Akshay Venkatesh, Ammar Ahmad Awan, and Dhabaleswar K. Panda, "CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters, ” IEEE/ACM CCGrid’16.
Available in MVAPICH2-GDR 2.3a
Overlap Symposium (April ’18) 64Network Based Computing Laboratory
• MVAPICH2/MVAPICH2-X– Job Startup
– Point-to-point Communication
– Remote Memory Access (RMA)
– Collective Communication
• MVAPICH2-GDR– Support for InfiniBand CORE-Direct
– GPU-kernel based Reduction
– Datatype Processing
• Deep Learning Application: OSU Caffe
Presentation Outline
Overlap Symposium (April ’18) 65Network Based Computing Laboratory
• Multi-dimensional data• Row based organization• Contiguous on one dimension • Non-contiguous on other dimensions
• Halo data exchange• Duplicate the boundary• Exchange the boundary in each
iteration
Halo data exchange
Non-contiguous Data Exchange
Overlap Symposium (April ’18) 66Network Based Computing Laboratory
MPI Datatype support in MVAPICH2
• Datatypes support in MPI– Operate on customized datatypes to improve productivity
– Enable MPI library to optimize non-contiguous data
At Sender: MPI_Type_vector (n_blocks, n_elements, stride, old_type, &new_type);MPI_Type_commit(&new_type);…MPI_Send(s_buf, size, new_type, dest, tag, MPI_COMM_WORLD);
• Inside MVAPICH2 - Use datatype specific CUDA Kernels to pack data in chunks- Efficiently move data between nodes using RDMA- In progress - currently optimizes vector and hindexed datatypes- Transparent to the userH. Wang, S. Potluri, D. Bureddy, C. Rosales and D. K. Panda, GPU-aware MPI on RDMA-Enabled Clusters: Design, Implementation and Evaluation, IEEE Transactions on Parallel and Distributed Systems, Vol. 25, No. 10, pp. 2595-2605 , Oct 2014.
Overlap Symposium (April ’18) 67Network Based Computing Laboratory
MPI Datatype Processing (Computation Optimization )
• Comprehensive support • Targeted kernels for regular datatypes - vector, subarray, indexed_block
• Generic kernels for all other irregular datatypes
• Separate non-blocking stream for kernels launched by MPI library • Avoids stream conflicts with application kernels
• Flexible set of parameters for users to tune kernels• Vector
• MV2_CUDA_KERNEL_VECTOR_TIDBLK_SIZE
• MV2_CUDA_KERNEL_VECTOR_YSIZE
• Subarray • MV2_CUDA_KERNEL_SUBARR_TIDBLK_SIZE • MV2_CUDA_KERNEL_SUBARR_XDIM• MV2_CUDA_KERNEL_SUBARR_YDIM • MV2_CUDA_KERNEL_SUBARR_ZDIM
• Indexed_block• MV2_CUDA_KERNEL_IDXBLK_XDIM
Overlap Symposium (April ’18) 68Network Based Computing Laboratory
MPI Datatype Processing (Communication Optimization )
Waste of computing resources on CPU and GPU
MPI_Isend(Buf1, ...,req1);
MPI_Isend(Buf2, ...,req2);
Application work on the CPU/GPU
MPI_Waitall(requests, …)
Common Scenario
*Buf1, Buf2…contain non-contiguous MPI Datatype
Overlap Symposium (April ’18) 69Network Based Computing Laboratory
• Modified ‘CUDA-Aware’ DDTBench for NAS_MG_y– Up to 90% overlap between datatype processing and other computation
0
20
40
60
80
100
[32x16x16] [128x64x64] [256x128x128] [512x256x256]
Ove
rlap
(%)
Input Size
Default Event-based Callback-based
C. Chu, K. Hamidouche, A. Venkatesh, D. Banerjee , H. Subramoni, and D. K. Panda, Exploiting Maximal Overlap for Non-Contiguous Data Movement Processing on Modern GPU-enabled Systems, IPDPS’16
MPI Datatype Processing (Communication Optimization )Available in MVAPICH2-GDR 2.3a
Overlap Symposium (April ’18) 70Network Based Computing Laboratory
• MVAPICH2/MVAPICH2-X– Job Startup
– Point-to-point Communication
– Remote Memory Access (RMA)
– Collective Communication
• MVAPICH2-GDR– Support for InfiniBand CORE-Direct
– GPU-kernel based Reduction
– Datatype Processing
• Deep Learning Application: OSU Caffe
Presentation Outline
Overlap Symposium (April ’18) 71Network Based Computing Laboratory
• Deep Learning frameworks are a different game altogether
– Unusually large message sizes (order of megabytes)
– Most communication based on GPU buffers
• Existing State-of-the-art– cuDNN, cuBLAS, NCCL --> scale-up performance
– NCCL2, CUDA-Aware MPI --> scale-out performance• For small and medium message sizes only!
• Proposed: Can we co-design the MPI runtime (MVAPICH2-GDR) and the DL framework (Caffe) to achieve both?
– Efficient Overlap of Computation and Communication
– Efficient Large-Message Communication (Reductions)
– What application co-designs are needed to exploit communication-runtime co-designs?
Deep Learning: New Challenges for MPI Runtimes
Scal
e-up
Per
form
ance
Scale-out Performance
cuDNN
NCCL
gRPC
Hadoop
ProposedCo-
Designs
MPIcuBLAS
A. A. Awan, K. Hamidouche, J. M. Hashmi, and D. K. Panda, S-Caffe: Co-designing MPI Runtimes and Caffe for Scalable Deep Learning on Modern GPU Clusters. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '17)
NCCL2
Overlap Symposium (April ’18) 72Network Based Computing Laboratory
• To address the limitations of Caffe and existing MPI runtimes, we propose the OSU-Caffe (S-Caffe) framework
• At the application (DL framework) level
– Develop a fine-grain workflow – i.e. layer-wise communication instead of communicating the entire model
• At the runtime (MPI) level
– Develop support to perform reduction of very-large GPU buffers
– Perform reduction using GPU kernels
OSU-Caffe: Proposed Co-Design Overview
OSU-Caffe is available from the HiDL project page(http://hidl.cse.ohio-state.edu)
Overlap Symposium (April ’18) 73Network Based Computing Laboratory
• Exploit Non-Blocking Collective (NBC) operations in MPI-3
– Divide communication into fine-grain steps
– Overlap computation of layer “i” with communication of layer “i+1”
– MPI_Ibcast to post all communication in advance
• Wait in an on-demand fashion
– Allow for runtime selection of data propagation design
• Based on message (DL model) size, number of GPUs, and number of nodes
• Co-design gradient aggregation at application level
– Helper thread based approach to realize a non-blocking MPI_Reduce
Optimized Data Propagation and Gradient Aggregation using NBC Designs
A. A. Awan, K. Hamidouche, J. M. Hashmi, and D. K. Panda, S-Caffe: Co-designing MPI Runtimes and Caffe for Scalable Deep Learning on Modern GPU Clusters, PPoPP ’17
Overlap Symposium (April ’18) 74Network Based Computing Laboratory
S-Caffe vs. Inspur-Caffe and Microsoft CNTK• AlexNet: Notoriously hard to scale-out on
multiple nodes due to comm. overhead!• Large number of parameters ~ 64 Million
(comm. buffer size = 256 MB)
S-Caffe delivers better or comparable performance withother multi-node capable DL frameworks
Up to 14% improvement (Scale-up)
Impact of HR
• GoogLeNet is a popular DNN• 13 million parameters (comm. buffer
size = ~50 MB)
Overlap Symposium (April ’18) 75Network Based Computing Laboratory
• Exploiting overlap between computation and communication is significant in HPC
• Presented some of the approaches and results along these directions taken by the MVAPICH2 and MVAPICH2-GDR Libraries
• Allows applications to take advantage of the overlap capabilities
• As exascale systems are getting more complicated in their architectures, solutions exploiting overlap capabilities will be important
Concluding Remarks
Overlap Symposium (April ’18) 76Network Based Computing Laboratory
• Tutorial: How to Boost the Performance of Your MPI and PGAS Applications with MVAPICH2 Libraries • 04/05/18, 9:00 am-12:00 noon
• Tutorial: High Performance Distributed Deep Learning: A Beginner’s Guide• 04/05/18, 1:00pm-4:00 pm
Additional Presentation and Tutorials
Overlap Symposium (April ’18) 77Network Based Computing Laboratory
Funding Acknowledgments
Funding Support by
Equipment Support by
Overlap Symposium (April ’18) 78Network Based Computing Laboratory
Personnel AcknowledgmentsCurrent Students (Graduate)
– A. Awan (Ph.D.)
– R. Biswas (M.S.)
– M. Bayatpour (Ph.D.)
– S. Chakraborthy (Ph.D.)– C.-H. Chu (Ph.D.)
– S. Guganani (Ph.D.)
Past Students – A. Augustine (M.S.)
– P. Balaji (Ph.D.)
– S. Bhagvat (M.S.)
– A. Bhat (M.S.)
– D. Buntinas (Ph.D.)
– L. Chai (Ph.D.)
– B. Chandrasekharan (M.S.)
– N. Dandapanthula (M.S.)
– V. Dhanraj (M.S.)
– T. Gangadharappa (M.S.)
– K. Gopalakrishnan (M.S.)
– R. Rajachandrasekar (Ph.D.)
– G. Santhanaraman (Ph.D.)
– A. Singh (Ph.D.)
– J. Sridhar (M.S.)
– S. Sur (Ph.D.)
– H. Subramoni (Ph.D.)
– K. Vaidyanathan (Ph.D.)
– A. Vishnu (Ph.D.)
– J. Wu (Ph.D.)
– W. Yu (Ph.D.)
Past Research Scientist– K. Hamidouche
– S. Sur
Past Post-Docs– D. Banerjee
– X. Besseron
– H.-W. Jin
– W. Huang (Ph.D.)
– W. Jiang (M.S.)
– J. Jose (Ph.D.)
– S. Kini (M.S.)
– M. Koop (Ph.D.)
– K. Kulkarni (M.S.)
– R. Kumar (M.S.)
– S. Krishnamoorthy (M.S.)
– K. Kandalla (Ph.D.)
– M. Li (Ph.D.)
– P. Lai (M.S.)
– J. Liu (Ph.D.)
– M. Luo (Ph.D.)
– A. Mamidala (Ph.D.)
– G. Marsh (M.S.)
– V. Meshram (M.S.)
– A. Moody (M.S.)
– S. Naravula (Ph.D.)
– R. Noronha (Ph.D.)
– X. Ouyang (Ph.D.)
– S. Pai (M.S.)
– S. Potluri (Ph.D.)
– J. Hashmi (Ph.D.)
– H. Javed (Ph.D.)– P. Kousha (Ph.D.)
– D. Shankar (Ph.D.)
– H. Shi (Ph.D.)
– J. Zhang (Ph.D.)
– J. Lin
– M. Luo
– E. Mancini
Current Research Scientists– X. Lu
– H. Subramoni
Past Programmers– D. Bureddy
– J. Perkins
Current Research Specialist– J. Smith
– M. Arnold
– S. Marcarelli
– J. Vienne
– H. Wang
Current Post-doc– A. Ruhela
– K. Manian
Current Students (Undergraduate)– N. Sarkauskas (B.S.)
Overlap Symposium (April ’18) 79Network Based Computing Laboratory
Thank You!
Network-Based Computing Laboratoryhttp://nowlab.cse.ohio-state.edu/
The High-Performance MPI/PGAS Projecthttp://mvapich.cse.ohio-state.edu/
The High-Performance Deep Learning Projecthttp://hidl.cse.ohio-state.edu/
The High-Performance Big Data Projecthttp://hibd.cse.ohio-state.edu/