s5146: data movement options for scalable gpu cluster...
Post on 13-Jul-2020
3 Views
Preview:
TRANSCRIPT
S5146: Data Movement Options for Scalable GPU Cluster Communication
Benjamin Klenk, PhD Student Institute of Computer Engineering
Ruprecht-Karls University of Heidelberg Germany
http://www.ziti.uni-heidelberg.de/ziti/en/ce-home
GTC 2015, San Jose, CA, US, 03/19/2015
S5146 Data Movement Options for Scalable GPU Cluster Communication
CUDA Programming Model
▪ GPU Computing & CUDA • Thread hierarchy, shared memory, barrier • SIMT – Single Instruction, Multiple Threads
▪ Collaborative computing • Partitioning, divergence • Synchronization
▪ Collaborative memory accesses • Slackness to avoid large caching structures • Strong need for coalescing • Caching to reduce traffic on memory bus
➔ What about communication?
!2
…
Output data setCo
mpu
te
…
Mem
ory Access
MC
S5146 Data Movement Options for Scalable GPU Cluster Communication
A single GPU is barely enough…
▪HPC demands for unlimited computational power ▪Workloads don’t fit in memory
• Graph computation • Deep learning • Molecular dynamics, astrophysics ▪Deploy several GPUs
• More FLOP/s • More GBs
➔ But: communication ➔ CUDA isn’t enough
!3
NIC
CPU
GPU
GDDR5
Network Fabric
DDR3
DDR3
DDR3
DDR3
12 GB/s
60 GB/s
288 GB/s
16 GB/s
16 GB/s
1.4 TFLOP/s
0.13 TFLOP/s
12 GB
64 GB
What am I going to talk about?
✦How does communication currently look like? ✦Problems with current models ✦ Introducing a global address space for GPUs ✦Performance and energy measurements
�4S5146 Data Movement Options for Scalable GPU Cluster Communication
S5146 Data Movement Options for Scalable GPU Cluster Communication
Review: Messaging-‐based Communication
!5
▪ MPI as de-facto standard ▪ CPU controls communication
▪ Put/Get • Memory registration • OS & driver interactions
▪ Work request generation ▪ Notification handling
• Where to put them?
Review: One-‐sided Communication
▪ MPI as de-facto standard ▪ CPU controls communication
▪ Put/Get • Memory registration • OS & driver interactions
▪ Work request generation ▪ Notification handling
• Where to put them?
GPU CPU NICPCIe PCIe
NIC CPU GPUPCIe PCIeNetwork
0
issue work request
network package write data to GPU memory 0
0
completion notification
read data from GPU memory
completion notificationx
CUDA stack1 20 xMPI stackComputation Possible overlap
�6S5146 Data Movement Options for Scalable GPU Cluster Communication
B. Klenk, L. Oden, and H. Fröning, “Analyzing put/get apis for thread-‐collaborative processors,” in HUCAA Workshop in conjunction with ICPP, Minneapolis, MN, USA, 2014.
S5146 Data Movement Options for Scalable GPU Cluster Communication
The Problem in Numbers
▪ IB Verbs QDR: CPU vs GPU ▪GPUs are incompatible with
messaging • Generating work requests • Registering memory • Polling on notifications • Controlling networking devices ▪ Bandwidth ~100x lower ▪ Kernel launch time equals a
32kB data movement
!7
See also: L. Oden, H. Fröning, F.J. Pfreundt, Infiniband-‐Verbs on GPU: A case study of controlling an Infiniband network device from the GPU, ASHES Workshop at IPDPS2014, to be published.
S5146 Data Movement Options for Scalable GPU Cluster Communication
GGAS – Global GPU Address SpacesReminder: everything in CUDA is thread-collaborative
!8
S5146 Data Movement Options for Scalable GPU Cluster Communication
Let’s get back to collaborative work
▪ GAS across GPUs • Address translation / target
identification • Special hardware support
required (NIC, EXTOLL)
▪ Severe limitations for full coherence and strong consistency
▪ Reverting to highly relaxed consistency models
!9
Lena Oden and Holger Fröning, GGAS: Global GPU Address Spaces for Efficient Communication in Heterogeneous Clusters, IEEE International Conference on Cluster Computing 2013, September 23-‐27, 2013, Indianapolis, US.
EXTOLL
▪HPC interconnection technology ▪ FPGA based (Xilinx Virtex-6)
• 157 MHz @ 64 bit datapaths • PCIe 2.0 • 4 Ports @ 16 Gb/s per direction
▪ ASIC under production • PCIe 3.0 (+root port) • 6 + 1 Ports @ 120 Gb/s per direction
▪MPI, Low-Level API, Open Source ▪ SMFU: supports GGAS
Holger Fröning and Heiner Litz, Efficient Hardware Support for the Partitioned Global Address Space, 10th Workshop on Communication Architecture for Clusters (CAC2010), co-‐located IPDPS 2010, April 19, 2010, Atlanta, Georgia.
www.extoll.de
�10S5146 Data Movement Options for Scalable GPU Cluster Communication
S5146 Data Movement Options for Scalable GPU Cluster Communication
GGAS – thread-‐collaborative BSP-‐like communication
!11
Computation…
…remote stores
…Continue …
Global barrier
… Computation
…Continue …
…
Computation
GDDR5
double *remote; remote = (double*) \\ get_ptr_of( node ) ; remote[ tid ] = data[ tid ] ;
do_work();
ggas_barrier();
S5146 Data Movement Options for Scalable GPU Cluster Communication
GGAS – Bandwidth comparison
!12
▪MPI • CPU-controlled • cudaMemcpy D2H + MPI_Send • MPI_Recv + cudaMemcpy H2D ▪GGAS
• GPU-controlled, GDDR to GDDR ▪RMA Direct
• GPU-controlled, GDDR to GDDR ▪RMA Host
• CPU-controlled • cudaMemcpy D2H + RMA_Put, • Get notification + cudaMemcpy H2D
GGAS ~ 2 µs RMA ~ 5 µs MPI ~ 10 µs
latency
S5146 Data Movement Options for Scalable GPU Cluster Communication
Analyzing Communication Models for Thread-parallel Processors in Terms of Energy and Time
How does GGAS compete with other methods?
!13
S5146 Data Movement Options for Scalable GPU Cluster Communication
Methodology
▪High Performance Computing; time to solution primary metric ▪ But: energy becomes dominating factor
▪We measure time, but we want to consider energy ▪ Power consumption needs to be determined, too
• CPU, DRAM: Intel RAPL • GPU: NVIDIA NVML
▪How do applications perform with regard to performance and energy?
!14
S5146 Data Movement Options for Scalable GPU Cluster Communication
Allreduce – Power and Energy Analysis
!15
Lena Oden, Benjamin Klenk and Holger Fröning, Energy-‐Efficient Collective Reduce and Allreduce Operations on Distributed GPUs, 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), May 26-‐29, 2014, Chicago, IL, US.
GGAS MPI
Accumulated energy consumption over time:
H. Fröning - Towards efficient communication methods and models - 2015/01/09
Workload analysis / application performance
▪ 12 Nodes (each 2xIntel Ivy Bridge, NVIDIA K20, EXTOLL FPGA) ▪ Normalized to MPI (higher than 1 > better performance, lower than 1 > worse performance)
!16
Benjamin Klenk, Lena Oden, Holger Fröning, Analyzing Communication Models for Distributed Thread-‐Collaborative Processors in Terms of Energy and Time, 2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS 2015), Philadelphia, PA. March 29-‐31, 2015.
H. Fröning - Towards efficient communication methods and models - 2015/01/09
Energy analysis
▪ Same cluster (12 nodes) ▪ Normalized to MPI
• lower than 1 > less energy • higher than 1 > more energy
▪ GGAS: 25% less energy ▪ RMA: 20% less energy
▪ Why? • Less power, CPU can sleep • Less execution time
!17
1. Benjamin Klenk, Lena Oden, Holger Fröning, Analyzing Communication Models for Distributed Thread-‐Collaborative Processors in Terms of Energy and Time, 2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS 2015), Philadelphia, PA. March 29-‐31, 2015.
S5146 Data Movement Options for Scalable GPU Cluster Communication
ConclusionsWhat have we learned?
!18
Summary
▪CPU controlled communication (MPI incl. MVAPICH & CUDA-aware MPI) • State-of-the-art • Context/domain switches → additional overhead due kernel start latency • Additional data copies → GPUDirect RDMA only for small messages • Programming complexity increases → CUDA+MPI+X, X ∈ {OpenMP, pthreads,..}
▪GPU controlled communication • Currently needs specialized hardware (e.g. EXTOLL) • Performance promising • Power consumption can be reduced by putting CPU to sleep mode • Inline with CUDA programming model
�19S5146 Data Movement Options for Scalable GPU Cluster Communication
S5146 Data Movement Options for Scalable GPU Cluster Communication
Conclusion
▪ Post-Dennard scaling ▪Communication/computation gap will dramatically increase in the future
➔ Heterogeneity in communication
▪ Abstractions and adaptivity minimizes complexity • Hardware optimizations and software libraries to support efficient
communication • Adaptive tasks models support dynamic application behavior • Hide architectural complexity
!20
Specialized processors like GPUs require specialized communication models
S5146 Data Movement Options for Scalable GPU Cluster Communication
Synergies: High Octane Project
▪Communication-centric cluster ▪ 8 Nodes
• 16 Intel Ivy Bridge CPUs • 16 NVIDIA K20 (currently 8) • 16 EXTOLL NICs (currently 8)
▪ Put/Get, MPI & GGAS support ▪Opportunity for other researchers with
various possible interactions • System-level software • Workloads from HPC and other domains • Compilers • Optimizations
!21
DDR3CPU
DDR3
GPU
GDDR5
NIC
GPU
GDDR5
NIC
CPU
S5146 Data Movement Options for Scalable GPU Cluster Communication
Credits
!22
Thank you!
Credits Contributions: Lena Oden (PhD student), Benjamin Klenk (PhD student), Alexander Matz (PhD student)
Discussions: Sudha Yalamanchili (Georgia Tech), Mark Hummel (Nvidia) Sponsoring: NVIDIA, Xilinx, German Excellence Initiative, Google EXTOLL: Ulrich Brüning, Mondrian Nüssle and the complete team
Current main interactions
http://www.ziti.uni-heidelberg.de/ziti/en/ce-home
top related