dgx superpod - indico.cern.ch€¦ · market and its insatiable demand for better 3d graphics,...

Instant AI HPC Infrastructure

DGX SUPERPOD

2

NVIDIA DGX SUPERPOD

Highest-Performance AI Research Supercomputer

— #20 on Top500 list | Top AI Performance Records

Instant AI Infrastructure

— Modular and scalable architecture

— Integrated and optimized compute, networking, storage, and software

NVIDIA-Optimized Software Stacks

— Freely available on NGC

Instant AI Compute Infrastructure

OUTLINE

© 2018 NVIDIA CORPORATION 2© 2018 NVIDIACORPORATION

The Case for “One Gigantic GPU” and NVLink Review

NVSwitch — Speeds and Feeds, Architecture and System Applications

DGX-2 Server — Speeds and Feeds, Architecture, and Packaging

NVIDIA SuperPOD

QUDA Performance on SuperPOD

MULTI-CORE AND CUDA WITH ONE GPU

Users explicitly express parallel work in NVIDIA CUDA®

GPU Driver distributes work to available Graphics Processing Clusters (GPC)/Streaming Multiprocessor (SM) cores

GPC/SM cores can compute on data in any of the second-generation High Bandwidth Memories (HBM2s)

GPC/SM cores use shared HBM2s to exchange data

GPU

GPC

GPC

Hub

NVLink

Copy

Engines

PCIe I/OWork (dataand CUDA Kernels)

Results (data)

CPU

HBM

2H

BM

2

L2

Cach

eL2

Cach

eL2

Cach

eL2

Cach

e

XBAR

NVLink


TWO GPUS WITH PCIE

Access to HBM2 of other GPU is at PCIe BW (32 GBps (bidirectional))

Interactions with CPU compete with GPU-to-GPU

PCIe is the “Wild West”(lots of performancebandits)

GPU0 GPU1

HBM

2 +

L2

Cach

es

HBM

2 +

L2

Cach

es

GPC

GPC

Hub

NVLink

Copy Engines

NVLink

Work (data and

CUDA

Kernels)

CPU

XBAR

HBM

2 +

L2

Cach

es

HBM

2 +

L2

Cach

es

XBAR

GPC

GPC

Hub

PCIe I/O PCIe I/O

Results (data)

NVLink

NVLink

Copy Engines


TWO GPUS WITH NVLINK

All GPCs can access all HBM2 memories

Access to HBM2 of other GPU is at multi-NVLink bandwidth(300 GBps bidirectional in V100 GPUs)

NVLinks are effectively a “bridge” between XBARs

No collisions with PCIe traffic

GPU0 GPU1

HBM

2 +

L2

Cach

es

HBM

2 +

L2

Cach

es

GPC

GPC

Hub

NVLink

Copy Engines

NVLink

Work (data and

CUDA

Kernels)

CPU

XBAR

HBM

2 +

L2

Cach

es

HBM

2 +

L2

Cach

es

XBAR

GPC

GPC

Hub

PCIe I/O PCIe I/O

Results (data)

NVLink

NVLink

Copy Engines


Problem Size Capacity

Problem size is limited by aggregateHBM2 capacity of entire set of GPUs,rather than capacity of singleGPU

Strong Scaling

NUMA-effects greatly reduced compared to existing solutions

Aggregate bandwidth to HBM2 grows with number of GPUs

Ease of Use

Apps written for small number of GPUs port more easily

Abundant resources enable rapid experimentation

“ONE GIGANTIC GPU” BENEFITS

GPU8

CPU

CPU

GPU12

GPU13

GPU14

GPU15

GPU4

GPU5

GPU6

GPU7

GPU0

GPU1

GPU2

GPU3

GPU9

GPU10

GPU11

NVLink XBAR

?


Two of these building blocks together form a fully connected 16 GPU cluster

Non-blocking, non-interfering (unless same destinationis involved)

Regular load, store, atomics just work

Left-right symmetry simplifies physical packaging, and manufacturability

DGX-2 NVLINK FABRIC

V100

V100

V100

V100

V100

V100

V100

V100

NV

SW

ITCH

NV

SW

ITCH

V100

V100

V100

V100

V100

V100

V100

V100


9

The World’s Most Powerful AI Computer

NVIDIA DGX-2

2 PFLOPS | 512GB HBM2 | 10kW | 350 lbs

NVLink Plane Card

8x EDR IB/100 GigE

2x Xeon Platinum

1.5TB System Memory

PCIe Switch Complex

30TB NVME SSDs

16x Tesla V100 32GB12x NVSwitch

Parameter DGX-2 Spec

CPUs Dual Xeon Platinum 8168

CPU Memory 1.5 TB DDR4

Aggregate Storage 30 TB (8 NVMes)

Peak Max TDP 10 kW

Dimensions (H/W/D)17.3”(10U)/ 19.0”/32.8”

(440.0mm/482.3mm/834.0mm)

Weight 340 lbs (154.2 kgs)

Cooling (forced air) 1,000 CFM


Parameter DGX-2 Spec

Number of Tesla V100 GPUs 16

Aggregate FP64/FP32 125/250 TFLOPS

Aggregate Tensor (FP16) 2000 TFLOPS

Aggregate Shared HBM2 512 GB

Aggregate HBM2 Bandwidth 14.4 TBps

Per-GPU NVLink Bandwidth 300 GBps bidirectional

Chassis Bisection Bandwidth 2.4 TBps

InfiniBand NICs 8 Mellanox EDR

NVIDIA DGX-2: SPEEDS AND FEEDS

Xeon sockets are QPI connected, but affinity-binding keepsGPU-related traffic off QPI

PCIe tree has NICs connected to pairs of GPUs to facilitate GPUDirectTM RDMAs over IB network

Configuration and control of theNVSwitches is via driver processrunning on CPUs

DGX-2 PCIE NETWORK

NVNNVVSS

WWWIITITIT

CCCHH

V100

V100

V100

V100

V100

V100

V100

NNVVVSS

WWIITITT

CCCHH

V100

V100

V100

V100

V100

V100

V100

x86x86

PCIE SW

PCIE SW

PCIE SW

PCIE SW

PCIE SW

V100

x6

V100

x6

PCIE SW

PCIE SW

PCIE SW

PCIE SW

PCIE SW

100G NIC

100G NIC

100G NIC

100G NIC

100G NIC

100G PCIE NIC SW

100G NIC

100G PCIE NIC SW

NV

SW

ITCH


NV

SW

ITCH

PCIE QPI SW

QPI PCIESW

12

AI Compute

96 NVIDIA DGX2 Nodes

1,536 V100 GPUs

192 PF Peak

49 TB HBM2 memory

Networking

1 Terabit Data Bandwidth per Node

10 Mellanox EDR InfiniBand per Node

Fully Connected EDR InfiniBand Switch

NVIDIA DGX SUPERPODInstant AI Compute Infrastructure

13

NVIDIA DGX SUPERPOD

Mellanox EDR 100G InfiniBand Network

Mellanox Smart Director Switches

In-Network Computing Acceleration Engines

Fast and Efficient Storage Access with RDMA

Up to 130Tb/s Switching Capacity per Switch

Ultra-Low Latency of 300ns

Integrated Network Manager

Terabit-Speed InfiniBand Networking per Node

…

Rack 1 Rack 16

ComputeBackplane

Switch

Storage Backplane

Switch

64 DGX-2

GPFS

200 Gb/s per node

800 Gb/s per node

36

DGX-2 STRONG SCALINGGlobal Volume 324, Wilson-Dslash

12,000

8,000

4,000

0

16,000

1 2 8 16

20,000

MPI double SHMEM double

MPI single

MPI half

SHMEM single

SHMEM half

GFlo

p/s

4

#GPUs

37

MULTI-NODE STRONG SCALINGDGX SuperPOD, Wilson, 643x128 global volume

0

100,000

200,000

300,000

400,000

16 32 64 256 512 1024

half GDR

single GDR double GDR

half NVSHMEM

single NVSHMEM double NVSHMEM

GFlo

p/s

128

#GPUs

26

FERMION SOLVERS

Combination of algorithm (multigrid) and machine (GPUs)

A single Volta can solve at 1 second per Wilson solve with local volume of V=323x64 per GPU

A single node (DGX-2) can solve solve V=643x128 at one second per solve

16 nodes of DGX-2 can solve V=1283x256 at one second per solve

Fermion solvers are not the challenge they used to be (caveats unbound)

为这个时代的达芬奇和爱因斯坦提供计算力

We set out 26 years ago to transform

computer graphics.

Fueled by the massive growth of the gaming

market and its insatiable demand for better

3D graphics, we’ve evolved

the GPU into a computer brain at the

intersection of virtual reality, high

performance computing, and

artificial intelligence.

NVIDIA GPU computing has

become the essential tool of

the da Vincis and Einsteins

of our time. For them,

we’ve built the equivalent

of a time machine.

Instant AI HPC Infrastructure

DGX SUPERPOD

34

NVSHMEMGPU-centric communication

Implementation of OpenSHMEM, a Partitioned Global Address Space (PGAS) library

NVSHMEM features

Allows kernel-side communication (API and LD/ST) between GPUs

NVLink and PCIe support (intranode), InfiniBand support (internode)

X86 and Power9 support

Interoperability with MPI and OpenSHMEM libraries

NVSHMEM has been developed as an NVIDIA internal co-design with QUDA

Early access (EA2) available – please reach out to [email protected]

mailto:[email protected]

16

QUDA NODE PERFORMANCE OVER TIMEMultiplicative speedup through software and hardware

GFlo

p/s

1

10

100

1000

0

500

1,000

1,500

2,000

2,500

3,000Speedup

2008 2010 2012 2014 2016 2018 2019

Speedupdeterminedby measured time to solution forsolving the Wilson operatoragainst a randomsource on aV=24364 lattice,

β=5.5,Mπ= 416 MeV.Onenode is definedtobe3GPUs

Speedup

Multi GPU

capable

Adaptive

Multigrid

Wilson FP32 GFLOPS

OptimizedMultigrid

Deflated

Multigrid

300x

dgx superpod - indico.cern.ch€¦ · market and its insatiable demand for better 3d graphics,...

Documents