s5146: data movement options for scalable gpu cluster...

S5146: Data Movement Options for Scalable GPU Cluster Communication

Benjamin Klenk, PhD Student Institute of Computer Engineering

Ruprecht-Karls University of Heidelberg Germany

http://www.ziti.uni-heidelberg.de/ziti/en/ce-home

GTC 2015, San Jose, CA, US, 03/19/2015

S5146 Data Movement Options for Scalable GPU Cluster Communication

CUDA Programming Model

▪ GPU Computing & CUDA • Thread hierarchy, shared memory, barrier • SIMT – Single Instruction, Multiple Threads

▪ Collaborative computing • Partitioning, divergence • Synchronization

▪ Collaborative memory accesses • Slackness to avoid large caching structures • Strong need for coalescing • Caching to reduce traffic on memory bus

➔ What about communication?

Output data setCo

ory Access

A single GPU is barely enough…

▪HPC demands for unlimited computational power ▪Workloads don’t fit in memory

• Graph computation • Deep learning • Molecular dynamics, astrophysics ▪Deploy several GPUs

• More FLOP/s • More GBs

➔ But: communication ➔ CUDA isn’t enough

Network Fabric

12 GB/s

60 GB/s

288 GB/s

16 GB/s

1.4 TFLOP/s

0.13 TFLOP/s

What am I going to talk about?

✦How does communication currently look like? ✦Problems with current models ✦ Introducing a global address space for GPUs ✦Performance and energy measurements

�4S5146 Data Movement Options for Scalable GPU Cluster Communication

Review: Messaging-‐based Communication

▪ MPI as de-facto standard ▪ CPU controls communication

▪ Put/Get • Memory registration • OS & driver interactions

▪ Work request generation ▪ Notification handling

• Where to put them?

Review: One-‐sided Communication

▪ MPI as de-facto standard ▪ CPU controls communication

▪ Put/Get • Memory registration • OS & driver interactions

▪ Work request generation ▪ Notification handling

• Where to put them?

GPU CPU NICPCIe PCIe

NIC CPU GPUPCIe PCIeNetwork

issue work request

network package write data to GPU memory 0

completion notification

read data from GPU memory

completion notificationx

CUDA stack1 20 xMPI stackComputation Possible overlap

B. Klenk, L. Oden, and H. Fröning, “Analyzing put/get apis for thread-‐collaborative processors,” in HUCAA Workshop in conjunction with ICPP, Minneapolis, MN, USA, 2014.

The Problem in Numbers

▪ IB Verbs QDR: CPU vs GPU ▪GPUs are incompatible with

messaging • Generating work requests • Registering memory • Polling on notifications • Controlling networking devices ▪ Bandwidth ~100x lower ▪ Kernel launch time equals a

32kB data movement

See also: L. Oden, H. Fröning, F.J. Pfreundt, Infiniband-‐Verbs on GPU: A case study of controlling an Infiniband network device from the GPU, ASHES Workshop at IPDPS2014, to be published.

GGAS – Global GPU Address SpacesReminder: everything in CUDA is thread-collaborative

Let’s get back to collaborative work

▪ GAS across GPUs • Address translation / target

identification • Special hardware support

required (NIC, EXTOLL)

▪ Severe limitations for full coherence and strong consistency

▪ Reverting to highly relaxed consistency models

Lena Oden and Holger Fröning, GGAS: Global GPU Address Spaces for Efficient Communication in Heterogeneous Clusters, IEEE International Conference on Cluster Computing 2013, September 23-‐27, 2013, Indianapolis, US.

EXTOLL

▪HPC interconnection technology ▪ FPGA based (Xilinx Virtex-6)

• 157 MHz @ 64 bit datapaths • PCIe 2.0 • 4 Ports @ 16 Gb/s per direction

▪ ASIC under production • PCIe 3.0 (+root port) • 6 + 1 Ports @ 120 Gb/s per direction

▪MPI, Low-Level API, Open Source ▪ SMFU: supports GGAS

Holger Fröning and Heiner Litz, Efficient Hardware Support for the Partitioned Global Address Space, 10th Workshop on Communication Architecture for Clusters (CAC2010), co-‐located IPDPS 2010, April 19, 2010, Atlanta, Georgia.

www.extoll.de

GGAS – thread-‐collaborative BSP-‐like communication

Computation…

…remote stores

…Continue …

Global barrier

… Computation

…Continue …

Computation

double *remote; remote = (double*) \\ get_ptr_of( node ) ; remote[ tid ] = data[ tid ] ;

do_work();

ggas_barrier();

GGAS – Bandwidth comparison

▪MPI • CPU-controlled • cudaMemcpy D2H + MPI_Send • MPI_Recv + cudaMemcpy H2D ▪GGAS

• GPU-controlled, GDDR to GDDR ▪RMA Direct

• GPU-controlled, GDDR to GDDR ▪RMA Host

• CPU-controlled • cudaMemcpy D2H + RMA_Put, • Get notification + cudaMemcpy H2D

GGAS ~ 2 µs RMA ~ 5 µs MPI ~ 10 µs

latency

Analyzing Communication Models for Thread-parallel Processors in Terms of Energy and Time

How does GGAS compete with other methods?

Methodology

▪High Performance Computing; time to solution primary metric ▪ But: energy becomes dominating factor

▪We measure time, but we want to consider energy ▪ Power consumption needs to be determined, too

• CPU, DRAM: Intel RAPL • GPU: NVIDIA NVML

▪How do applications perform with regard to performance and energy?

Allreduce – Power and Energy Analysis

Lena Oden, Benjamin Klenk and Holger Fröning, Energy-‐Efficient Collective Reduce and Allreduce Operations on Distributed GPUs, 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), May 26-‐29, 2014, Chicago, IL, US.

GGAS MPI

Accumulated energy consumption over time:

H. Fröning - Towards efficient communication methods and models - 2015/01/09

Workload analysis / application performance

▪ 12 Nodes (each 2xIntel Ivy Bridge, NVIDIA K20, EXTOLL FPGA) ▪ Normalized to MPI (higher than 1 > better performance, lower than 1 > worse performance)

Benjamin Klenk, Lena Oden, Holger Fröning, Analyzing Communication Models for Distributed Thread-‐Collaborative Processors in Terms of Energy and Time, 2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS 2015), Philadelphia, PA. March 29-‐31, 2015.

H. Fröning - Towards efficient communication methods and models - 2015/01/09

Energy analysis

▪ Same cluster (12 nodes) ▪ Normalized to MPI

• lower than 1 > less energy • higher than 1 > more energy

▪ GGAS: 25% less energy ▪ RMA: 20% less energy

▪ Why? • Less power, CPU can sleep • Less execution time

1. Benjamin Klenk, Lena Oden, Holger Fröning, Analyzing Communication Models for Distributed Thread-‐Collaborative Processors in Terms of Energy and Time, 2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS 2015), Philadelphia, PA. March 29-‐31, 2015.

ConclusionsWhat have we learned?

Summary

▪CPU controlled communication (MPI incl. MVAPICH & CUDA-aware MPI) • State-of-the-art • Context/domain switches → additional overhead due kernel start latency • Additional data copies → GPUDirect RDMA only for small messages • Programming complexity increases → CUDA+MPI+X, X ∈ {OpenMP, pthreads,..}

▪GPU controlled communication • Currently needs specialized hardware (e.g. EXTOLL) • Performance promising • Power consumption can be reduced by putting CPU to sleep mode • Inline with CUDA programming model

Conclusion

▪ Post-Dennard scaling ▪Communication/computation gap will dramatically increase in the future

➔ Heterogeneity in communication

▪ Abstractions and adaptivity minimizes complexity • Hardware optimizations and software libraries to support efficient

communication • Adaptive tasks models support dynamic application behavior • Hide architectural complexity

Specialized processors like GPUs require specialized communication models

Synergies: High Octane Project

▪Communication-centric cluster ▪ 8 Nodes

• 16 Intel Ivy Bridge CPUs • 16 NVIDIA K20 (currently 8) • 16 EXTOLL NICs (currently 8)

▪ Put/Get, MPI & GGAS support ▪Opportunity for other researchers with

various possible interactions • System-level software • Workloads from HPC and other domains • Compilers • Optimizations

DDR3CPU

Credits

Thank you!

Credits Contributions: Lena Oden (PhD student), Benjamin Klenk (PhD student), Alexander Matz (PhD student)

Discussions: Sudha Yalamanchili (Georgia Tech), Mark Hummel (Nvidia) Sponsoring: NVIDIA, Xilinx, German Excellence Initiative, Google EXTOLL: Ulrich Brüning, Mondrian Nüssle and the complete team

Current main interactions

http://www.ziti.uni-heidelberg.de/ziti/en/ce-home

s5146: data movement options for scalable gpu cluster...

Documents

gpu-accelerated scalable solver for banded linear systems

tools and tips for managing a gpu cluster

gpu cluster computing for fem - mathematik.tu-dortmund.de

chopin: scalable graphics rendering in multi-gpu systems

increasing the throughput of your gpu-enabled cluster with

towards scalable and efﬁcient gpu-enabled slicing...

webgpu: a scalable online development platform for gpu...

high performance and scalable gpu graph traversal · high...

scalable cluster management: frameworks, tools, and systems

ning li jordan parker scalable cluster resource management 1...

article rgca: a reliable gpu cluster architecture for

a scalable framework for heterogeneous gpu-based clusters

themis: fair and efﬁcient gpu cluster schedulingthemis:...

gpu cluster for high performance computing - cvc - stony...

simulation of lattice qcd with a gpu cluster

clustera: a data-centric approach to scalable cluster...

gpu cluster

sgrt: a scalable mobile gpu architecture based on ray...

gpu ray-casting for scalable terrain rendering ·...

scalable group communication in heterogeneous cluster