breaking the latency barrier strong scaling lqcd on … · cuda aware mpi: mpi can take gpu pointer...

Kate Clark, Mathias Wagner, Evan Weinberg

BREAKING THE LATENCY BARRIERSTRONG SCALING LQCD ON GPUS

QUDA• “QCD on CUDA” – http://lattice.github.com/quda (open source, BSD license) • Effort started at Boston University in 2008, now in wide use as the GPU backend for BQCD,

Chroma, CPS, MILC, TIFR, etc. • Provides:

— Various solvers for all major fermionic discretizations, with multi-GPU support — Additional performance-critical routines needed for gauge-field generation

• Maximize performance • Exploit physical symmetries to minimize memory traffic • Mixed-precision methods • Autotuning for high performance on all CUDA-capable architectures • Domain-decomposed (Schwarz) preconditioners for strong scaling • Eigenvector and deflated solvers (Lanczos, EigCG, GMRES-DR) • Multi-source solvers • Multigrid solvers for optimal convergence

• A research tool for how to reach the exascale

QUDA CONTRIBUTORS

§ Ron Babich (NVIDIA) § Simone Bacchio (Cyprus) § Michael Baldhauf (Regensburg) § Kip Barros (LANL) § Rich Brower (Boston University) § Nuno Cardoso (NCSA) § Kate Clark (NVIDIA) § Michael Cheng (Boston University) § Carleton DeTar (Utah University) § Justin Foley (Utah -> NIH) § Joel Giedt (Rensselaer Polytechnic Institute) § Arjun Gambhir (William and Mary) § Steve Gottlieb (Indiana University) § Kyriakos Hadjiyiannakou (Cyprus)

§ Dean Howarth (BU) § Bálint Joó (Jlab) § Hyung-Jin Kim (BNL -> Samsung) § Bartek Kostrzewa (Bonn) § Claudio Rebbi (Boston University) § Hauke Sandmeyer (Bielefeld) § Guochun Shi (NCSA -> Google) § Mario Schröck (INFN) § Alexei Strelchenko (FNAL) § Jiqun Tu (Columbia) § Alejandro Vaquero (Utah University) § Mathias Wagner (NVIDIA) § Evan Weinberg (NVIDIA) § Frank Winter (Jlab)

10+ years - lots of contributors

MULTI GPU BUILDING BLOCKS

Halo packing Kernel

Interior Kernel

Halo communication

Halo update Kernel

Multi GPU Parallelization

faceexchange

wraparound

faceexchange

wraparound

Tuesday, July 12, 2011

Halo packing Kernel

Interior Kernel

Halo communication

Halo update Kernel

SOME TERMINOLGY

P2P: direct exchange between GPUs

P2P, CUDA aware, GPU Direct RDMA

NVIDIA DGX-1 With Tesla V100 System Architecture WP-08437-002_v01 | 9

V100GPU7

V100GPU4

V100GPU6

V100GPU5

CPU1NIC NIC

PCIe Switches

V100GPU0

V100GPU3

V100GPU1

V100GPU2

CPU0NIC NIC

PCIe Switches

NVLink PCIe QPI

Figure 4 DGX-1 uses an 8-GPU hybrid cube-mesh interconnection network topology. The corners of the mesh-connected faces of the cube are connected to the PCIe tree network, which also connects to the CPUs and NICs.

SOME TERMINOLGY

V100GPU7

V100GPU4

V100GPU6

V100GPU5

CPU1NIC NIC

PCIe Switches

V100GPU0

V100GPU3

V100GPU1

V100GPU2

CPU0NIC NIC

PCIe Switches

NVLink PCIe QPI

SOME TERMINOLGY

CUDA aware MPI: MPI can take GPU pointer

can do batched, overlapping transfers (high bandwidth)

V100GPU7

V100GPU4

V100GPU6

V100GPU5

CPU1NIC NIC

PCIe Switches

V100GPU0

V100GPU3

V100GPU1

V100GPU2

CPU0NIC NIC

PCIe Switches

NVLink PCIe QPI

SOME TERMINOLGY

V100GPU7

V100GPU4

V100GPU6

V100GPU5

CPU1NIC NIC

PCIe Switches

V100GPU0

V100GPU3

V100GPU1

V100GPU2

CPU0NIC NIC

PCIe Switches

NVLink PCIe QPI

SOME TERMINOLGY

V100GPU7

V100GPU4

V100GPU6

V100GPU5

CPU1NIC NIC

PCIe Switches

V100GPU0

V100GPU3

V100GPU1

V100GPU2

CPU0NIC NIC

PCIe Switches

NVLink PCIe QPI

MPI_Sendrecv

SOME TERMINOLGY

GPU Direct RDMA: (implies CUDA aware MPI) RDMA transfer from/to GPU memoryhere NIC <-> GPU

V100GPU7

V100GPU4

V100GPU6

V100GPU5

CPU1NIC NIC

PCIe Switches

V100GPU0

V100GPU3

V100GPU1

V100GPU2

CPU0NIC NIC

PCIe Switches

NVLink PCIe QPI

MPI_Sendrecv

SOME TERMINOLGY

GPU Direct RDMA: (implies CUDA aware MPI) RDMA transfer from/to GPU memoryhere NIC <-> GPU

V100GPU7

V100GPU4

V100GPU6

V100GPU5

CPU1NIC NIC

PCIe Switches

V100GPU0

V100GPU3

V100GPU1

V100GPU2

CPU0NIC NIC

PCIe Switches

NVLink PCIe QPI

MPI_Sendrecv

STRONG SCALING

Multiple meanings Same problem size, more nodes, more GPUs Same problem, next generation GPUs Multigrid - strong scaling within the same run (not discussed here)

To tame strong scaling we have to understand the limiters Bandwidth limiters Latency limiters

SINGLE GPU PERFORMANCEWilson Dslash

Tesla V100 CUDA 10.1 GCC 7.3

8121620242832

Lattice length

half blocksize=32 single blocksize=32 double blocksize-32

half tuned single tuned double tuned

“strong scaling”1180 GB/s

1291 GB/s

1312 GB/s

SINGLE GPU PERFORMANCEWilson Dslash

Tesla V100 CUDA 10.1 GCC 7.3

8121620242832

Lattice length

half blocksize=32 single blocksize=32 double blocksize-32

half tuned single tuned double tuned

“strong scaling”1180 GB/s

1291 GB/s

1312 GB/s Look at scaling of Dslash on one DGX-1:half precision with 164 local volume

WHAT IS LIMITING STRONG SCALING

Staging MPI transfers through host memory

classical host staging

D2H copies H2D copy t H2D copy y H2D copy z

DGX-1,164 local volume, half precision, 1x2x2x2 partitioning

Interior kernel

Packing kernel

Halo t Halo z Halo y

297 µs

API OVERHEADS

CPU overheads and synchronization are expensive

Pack Interior

Halo zD2H copies Halo t Halo y

H2D copy H2D copy H2D copy

USING NVLINK AND FUSING KERNELS

fewer copies with higher bandwidth, fewer kernels, less API overhead

P2P copies

Interior kernelPacking kernel Fused Halo

129 µs

REMOTE WRITE

Packing kernel writes to remote GPU using CUDA IPC

Interior kernelPacking kernel Fused Halo

SyncSync

89 µs

NVSHMEM

Implementation of OpenSHMEM, a Partitioned Global Address Space (PGAS) library

NVSHMEM features Symmetric memory allocations in device memory Communication API calls on CPU (standard and stream-ordered) Allows kernel-side communication (API and LD/ST) between GPUs

NVLink and PCIe support (intranode), InfiniBand support (internode) X86 and Power9 support Interoperability with MPI and OpenSHMEM libraries

Early access (EA2) available – please reach out to nvshmem@nvidia.com

GPU centric communication

DSLASH NVSHMEM IMPLEMENTATION

Keep general structure of packing, interior and exterior Dslash

Use nvshmem_ptr for intra-node remote writes (fine-grained)

Packing buffer is located on remote device Use nvshmem_putmem_nbi to write to remote GPU over IB (1 RDMA transfer)

Need to make sure writes are visible: nvshmem_barrier_all_on_stream

or barrier kernel that only waits for writes from neighbors

Disclaimer:Results from an first implementation in QUDA with a pre-release version of NVSHMEM

First exploration

NVSHMEM DSLASH

DGX-1,164 local volume, half precision, 1x2x2x2 partioning

Packing kernel Fused Halo

Interior kernel

Barrier kernel

53 µs

NVSHMEM + FUSING KERNELS

no extra packing and barrier kernels needed

DGX-1,164 local volume, half precision, 1x2x2x2 partioning

Barrier + Fused HaloInterior + Pack + Flag kernel

36 µs

LATENCY OPTIMIZATIONSDifferent strategies implemented

IPC co

IPC re

mote w

MEM ke

DGX-1,164 local volume, Wilson, half precision, 1x2x2x2 partitioning

DGX-2: FULL NON-BLOCKING BANDWIDTH2.4 TB/s bisection bandwidth

FULL NON-BLOCKING BANDWIDTH

NVSwitch

DGX-2 STRONG SCALINGGlobal Volume 324, Wilson-Dslash

12,000

16,000

20,000

1 2 4 8 16

MPI double SHMEM doubleMPI single SHMEM singleMPI half SHMEM half

12,000

16,000

20,000

1 2 4 8 16

12,000

16,000

20,000

1 2 4 8 16

DGX-2 STRONG SCALINGGlobal Volume 324, HISQ-Dslash

12,000

1 2 4 8 16

12,000

1 2 4 8 16

12,000

1 2 4 8 16

NVSHMEM OUTLOOKintra-kernel synchronization and communication

One kernel to rule them all ! Communication is handled in the kernel and latencies are hidden.

MULTI-NODE SCALING

SCALING TESTBEDS

NVIDIA DGX-1 8 V100 (16GB) 4 EDR IB 50/100 GB/s NVLink P2P*

Top500 #22

NVIDIA DGX-2H 16 Tesla V100 (32GB) 8 EDR IB 300 GB/s NVLink P2P*

Top500 #1

IBM AC 922 6 Tesla V100 (16GB) 2 EDR IB 100 GB/s NVLink P2P*

PROMETHEUS (NVIDIA) DGX SuperPOD (NVIDIA) SUMMIT (OLCF)

*) bi-directional bandwidth between two GPUs

MULTI-NODE STRONG SCALINGPrometheus, Wilson, 643x128 global volume

100,000

200,000

300,000

400,000

8 16 32 64 128 256 512 1024

single single GDR single NVSHMEM

100,000

200,000

300,000

400,000

8 16 32 64 128 256 512 1024

100,000

200,000

300,000

400,000

8 16 32 64 128 256 512 1024

100,000

200,000

300,000

400,000

8 16 32 64 128 256 512 1024

single single GDR single NVSHMEMhalf half GDR half NVSHMEMdouble double GDR double NVSHMEM

100,000

200,000

300,000

400,000

8 16 32 64 128 256 512 1024

single single GDR single NVSHMEMhalf half GDR half NVSHMEMdouble double GDR double NVSHMEM

MULTI-NODE STRONG SCALINGhalf precision, Wilson, 643x128 global volume

100,000

200,000

300,000

400,000

8 16 32 64 128 256 512 1024

Prometheus MPI Prometheus NVSHMEMDGX SuperPOD MPI DGX SuperPOD NVSHMEMSUMMIT* MPI SUMMIT* NVSHMEM

100,000

200,000

300,000

400,000

8 16 32 64 128 256 512 1024

100,000

200,000

300,000

400,000

8 16 32 64 128 256 512 1024

*) NOTE: SUMMIT results have been obtained using 64^3x96 global volume and scaled by 8/6 (x and y-axis) to make up for 6 GPUs/node. At a given number of GPUs the local volume per GPU is identical for all systems.

STRONG SCALING LATTICE QCD …

Optimize for latency and bandwidth Autotune kernel performance and comms policy

GPU centric communication with NVSHMEM reduce number of kernels through fusion reduce API overheads and synchronizations simplify code (fewer lines)

Significantly improved strong scaling … … but still early results (stay tuned)

generalize for all Dirac operators including Multigrid (coarse operators)

… FURTHER WITH NVSHMEM

16 32 64 128 256 512 1024

Prometheus MPI Prometheus NVSHMEMDGX SuperPOD MPI DGX SuperPOD NVSHMEM

half precision, Wilson, 643x128 global volume

POSTER SESSION

RAFFLE

Tuesday, 17:50

LATTICE QCD ON NVIDIA® TESLA® V100

QUDA 1.0

Tuesday, 15:20, Algorithms & Machines: Leadership-Class Multi-Grid Algorithms for HISQ Fermions on GPUs.Tuesday, 15:40, Algorithms & Machines: Breaking the latency barrier: Strong scaling LQCD on GPUs.Thursday, 11:15, Plenary: GPUs for Lattice Field Theory.

QUDA: A LIBRARY FOR QCD AND BEYOND ON GPUs

QUDA is an open source community-developed and NVIDIA-supported library for performing LQCD and strongly coupled BSM calculations on GPUs, leveraging NVIDIA’s CUDA platform. QUDA provides highly optimized mixed-precision linear solvers, eigenvector solvers, multi-grid algorithms, gauge-link fattening and fermion force algorithms Supported fermion types are: Wilson, Wilson-clover, twisted mass singlet and doublet, twisted mass clover, naïve staggered, improved staggered (ASQTAD or HISQ), domain-wall (4-d or 5-d) and möbiusUse of multiple GPUs in parallel is supported throughout the library, with inter-GPU communication achieved using MPI or QMP. Several commonly used LQCD applications integrate support for QUDA as a compile-time option, including Chroma, MILC, CPS, BQCD, TIFR and tmLQCDQUDA is an unparalleled research tool for reaching the exascale. With its new high-level C++11 framework, QUDA enables testing new fermion discretizations and new algorithms performantly and at scale with relative ease

WHY GPUs?LQCD simulations are typically memory-bandwidth bound, and so run very efficiently on GPUsLQCD simulations have high degrees of data parallelism that can be expressed effectively using the single instruction multiple data (SIMD) paradigm. This makes LQCD ideal for GPU deploymentMost LQCD calculations require only local communication on the 4-d lattice. This makes them suitable for deployment on multiple GPUs through partitioning the lattice into disjoint equal sub-lattices and distributing these between GPUs GPUs are prevalent on the fastest computing clusters and supercomputers in the world

QUDA REWRITEOld Dslash kernel code became increasingly limitingRewrite brings a lot of benefits:- Extensibility, composability and maintainability: “one Wilson kernel to rule them all”- Ability to add new discretizations easily: e.g. Twisted clover doublet- Changing representation, Nc, etc: Accessor abstraction- Larger local volumes- Ability to run on CPU: Future framework for all architectures??

Same or better performance

ROAD TO THE FUTUREBlock solvers for all fermion actionsEigenvector compressionContinuous optimization of multi-grid algorithmsImproved scaling: NVSHMEMC++ InterfaceAlternative compilation targets (e.g., C++17 pSTL)

ANNOUNCING QUDA 1.0

Complete re-write of all operators in a new Dslash framework Jitify support: huge reduction in compile time, enabling even more rapid algorithm developmentImproved Multi-GPU performance taking advantage of latest GPU Direct RDMA and NVLink technologiesAdaptive Multigrid for Wilson, Wilson-clover, Twisted-mass, Twisted-clover, Staggered, and Improved Staggered fermionsA polynomial-accelerated Lanczos as well as deflated CG, communication-avoiding (CA) CG, and SVD-deflated (CA-)GCRSignificantly improved build system and automated unit testsVarious other new routines and algorithms, code cleanup and bug fixes

TESLA VOLTA V10021B transistors

815 mm2

16/32 GB HBM2900 GB/s HBM2300 GB/s NVLink

80 SM5120 CUDA Cores640 Tensor Cores

MULTI-GPU WILSON DSLASHGlobal Volume = 324, DGX-2, NVLink + NVSwitch

GFLOPS Double MPIDouble NVSHMEM

Single MPISingle NVSHMEM

Half MPIHalf NVSHMEM

1 42 1680

2000400060008000

1000012000140001600018000

Authors: Kate Clark, Mathias Wagner, Evan Weinberg, NVIDIA

Two GPU Boards8 V100 32GB GPUs per board6 NVSwitches per board512GB Total HBM2 Memoryinterconnected byPlane Card

Eight EDR Infiniband/100 GigE1600 Gb/sec Total Bi-directional Bandwidth

PCIe Switch Complex

Two Intel Xeon Platinum CPUs

1.5 TB System MemoryDual 10/25 Gb/secEthernet

30 TB NVME SSDs Internal Storage

Twelve NVSwitches2.4 TB/sec bi-sectionbandwidth

NVIDIA Tesla V100 32GB

NVIDIA DGX-2

QUDA NODE PERFORMANCE OVER TIME

Time to solution is measured for solving the Wilson operator against a random source on a 24x24x24x64 lattice, β=5.5, Mл= 416 MeV.One node is defined to be 3 GPUs.

GFLOPS SPEEDUPWilson FP32 GFLOPs Speedup

2008 2010 2012 2014 2016 2018 2019

Multi GPUcapable

AdaptiveMG

Optimized MG

DeflatedMG>

>>>>>>

NVSHMEMImplementation of OpenSHMEM standard for GPUsGPU-centric communicationCommunication controlled in the kernels Fine-grained load/store over NVLink Bulk transfers over Infiniband using RDMAReduce number of required kernels by enabling kernel fusionRemove API overheads and GPU-CPU synchronizations

DEFLATED ADAPTIVE MG AT THE PHYSICAL POINTIn collaboration with Dean Howarth (BU)

Multigrid shifts the lowest eigenvalues to the coarse gridsSome Dirac operators (staggered and twisted mass) end up with pathological coarse-grid spectrumFor Twisted-clover operator, solution has been to add a fictitious heavy twist to the coarse operator to improve its condition number at the cost of decreased multigrid efficiency [Alexandrou et al]Instead we deflate the coarse grid operator recovering optimal MG convergence and a 3x speedup over ”mu scaling”

CHROMA HMC-MG ON SUMMITIn collaboration with Bálint Joó – Chroma, Boram Yoon – Force gradient integrator, Frank Winter – QDP-JIT

From Titan running 2016 code to Summit running 2019 code we see >82x speedup in HMC throughputMultiplicative speedup coming from machine and algorithm innovationHighly optimized multigrid for gauge field evolution

Titan (1024x K20X)

Summit (128x V100)

Summit (128x V100, Nov 2019)

Summit (128x V100, Mar 2019)

Titan (512x K20X)

10001500200025003000350040004500

974439 392

4.1x faster on 2x fewer GPUs ~8x gain

10.2x faster on 8x fewer GPUs ~82x gain

MULTI-GPU HISQ DSLASHGlobal Volume = 324, DGX-2, NVLink + NVSwitch

GFLOPS

GPUs1 42 168

Double MPIDouble NVSHMEM

Single MPISingle NVSHMEM

Half MPIHalf NVSHMEM

We are pleased to announce QUDA 1.0

SUMMIT AND SIERRA ARE NOW LIVEIBM POWER9 + NVIDIA Volta V100 + NVLinkSummit: 4608 nodes (6 GPUs/node): 200 PFLOPS peak - The most powerful computer in the world

Sierra: 4320 nodes (4 GPUs/node): 125 PFLOPS peak - 2nd most powerful computer in the world

MULTI-NODE WILSON DSLASHGlobal Volume = 643x128, DGX1 with 8 V100, NVLink, 4 EDR IB

GFLOPS

100000

150000

200000

250000

GPUs8 16 32 64 128 256 512

MPI P2P off, GDR off half MPI P2P off, GDR off single MPI P2P off, GDR off double

MPI P2P on, GDR on half MPI P2P on, GDR on single MPI P2P on, GDR on double

NVSHMEM half NVSHMEM single NVSHMEM double

Time to solution is measured running QUDA on 4 DGX-1 (32 V100) for solvingthe Twisted-mass + clover operator against a random source on a 643x128 lattice, β = 1.778, κ = 0.139427, µ = 0.000720, solver tolerance 10-7

CG double CG double-single CG double-half Baseline muscaling MG

CA-GCR

Speedup

Deflate 512vectors

Deflate 1024vectors

01020304050607080

1009090

4-level MG, deflate 64 eigenvectorsCG

0.001 0.002 0.004 0.008 0.016 0.032

HISQ MULTIGRID483x96, β = 6.4 (quenched), HISQ fermions

Time to solution is measured time for solving the HISQ operator against a random source. Runs were performed on 3 DGX-1 (24 V100 total).

MULTI-NODE STRONG SCALINGdouble precision, Wilson, 643x128 global volume

50,000

100,000

150,000

8 16 32 64 128 256 512 1024

MULTI-NODE STRONG SCALINGsingle precision, Wilson, 643x128 global volume

50,000

100,000

150,000

200,000

250,000

8 16 32 64 128 256 512 1024

breaking the latency barrier strong scaling lqcd on … · cuda aware mpi: mpi can take gpu pointer...

Documents

gpu computing and cuda

gpu architecture & cuda programming

multi gpu programming with mpi - gpu technology...

gpudirect, cuda aware mpi, & cuda ipc...steve abbott, summit...

gpgpu programming on example of cuda - panoramix -...

multi gpu programming with cuda and mpi

cs 179: gpu programming lab 7 recitation: the mpi/cuda wave...

chapter 18 gpu ( cuda)

code gpu with cuda - cuda introduction

performance study of cuda -aware mpi libraries for gpu...

an introduction to gpu computing and cuda...

in copyright - non-commercial use permitted rights ... ·...

gpu implementation of hpgmg-fv · cuda-aware mpi mpi...

gpu history cuda

parallel transport time dependent density functional...

gpu programming and cuda

inf5063 – gpu & cuda

cuda gpu computing

gpu computing with cuda

gpudirect, cuda aware mpi, & cuda ipc€¦ · steve abbott,...