dgx superpod - indico.cern.ch€¦ · market and its insatiable demand for better 3d graphics,...
TRANSCRIPT
Instant AI HPC Infrastructure
DGX SUPERPOD
2
NVIDIA DGX SUPERPOD
Highest-Performance AI Research Supercomputer
— #20 on Top500 list | Top AI Performance Records
Instant AI Infrastructure
— Modular and scalable architecture
— Integrated and optimized compute, networking, storage, and software
NVIDIA-Optimized Software Stacks
— Freely available on NGC
Instant AI Compute Infrastructure
OUTLINE
© 2018 NVIDIA CORPORATION 2© 2018 NVIDIACORPORATION
The Case for “One Gigantic GPU” and NVLink Review
NVSwitch — Speeds and Feeds, Architecture and System Applications
DGX-2 Server — Speeds and Feeds, Architecture, and Packaging
NVIDIA SuperPOD
QUDA Performance on SuperPOD
MULTI-CORE AND CUDA WITH ONE GPU
Users explicitly express parallel work in NVIDIA CUDA®
GPU Driver distributes work to available Graphics Processing Clusters (GPC)/Streaming Multiprocessor (SM) cores
GPC/SM cores can compute on data in any of the second-generation High Bandwidth Memories (HBM2s)
GPC/SM cores use shared HBM2s to exchange data
GPU
GPC
GPC
Hub
NVLink
Copy
Engines
PCIe I/OWork (dataand CUDA Kernels)
Results (data)
CPU
HBM
2H
BM
2
L2
Cach
eL2
Cach
eL2
Cach
eL2
Cach
e
XBAR
NVLink
© 2018 NVIDIA CORPORATION 2© 2018 NVIDIACORPORATION
TWO GPUS WITH PCIE
Access to HBM2 of other GPU is at PCIe BW (32 GBps (bidirectional))
Interactions with CPU compete with GPU-to-GPU
PCIe is the “Wild West”(lots of performancebandits)
GPU0 GPU1
HBM
2 +
L2
Cach
es
HBM
2 +
L2
Cach
es
GPC
GPC
Hub
NVLink
Copy Engines
NVLink
Work (data and
CUDA
Kernels)
CPU
XBAR
HBM
2 +
L2
Cach
es
HBM
2 +
L2
Cach
es
XBAR
GPC
GPC
Hub
PCIe I/O PCIe I/O
Results (data)
NVLink
NVLink
Copy Engines
© 2018 NVIDIA CORPORATION 2© 2018 NVIDIACORPORATION
TWO GPUS WITH NVLINK
All GPCs can access all HBM2 memories
Access to HBM2 of other GPU is at multi-NVLink bandwidth(300 GBps bidirectional in V100 GPUs)
NVLinks are effectively a “bridge” between XBARs
No collisions with PCIe traffic
GPU0 GPU1
HBM
2 +
L2
Cach
es
HBM
2 +
L2
Cach
es
GPC
GPC
Hub
NVLink
Copy Engines
NVLink
Work (data and
CUDA
Kernels)
CPU
XBAR
HBM
2 +
L2
Cach
es
HBM
2 +
L2
Cach
es
XBAR
GPC
GPC
Hub
PCIe I/O PCIe I/O
Results (data)
NVLink
NVLink
Copy Engines
© 2018 NVIDIA CORPORATION 2© 2018 NVIDIACORPORATION
Problem Size Capacity
Problem size is limited by aggregateHBM2 capacity of entire set of GPUs,rather than capacity of singleGPU
Strong Scaling
NUMA-effects greatly reduced compared to existing solutions
Aggregate bandwidth to HBM2 grows with number of GPUs
Ease of Use
Apps written for small number of GPUs port more easily
Abundant resources enable rapid experimentation
“ONE GIGANTIC GPU” BENEFITS
GPU8
CPU
CPU
GPU12
GPU13
GPU14
GPU15
GPU4
GPU5
GPU6
GPU7
GPU0
GPU1
GPU2
GPU3
GPU9
GPU10
GPU11
NVLink XBAR
?
© 2018 NVIDIA CORPORATION 2© 2018 NVIDIACORPORATION
Two of these building blocks together form a fully connected 16 GPU cluster
Non-blocking, non-interfering (unless same destinationis involved)
Regular load, store, atomics just work
Left-right symmetry simplifies physical packaging, and manufacturability
DGX-2 NVLINK FABRIC
V100
V100
V100
V100
V100
V100
V100
V100
NV
SW
ITCH
NV
SW
ITCH
V100
V100
V100
V100
V100
V100
V100
V100
© 2018 NVIDIA CORPORATION 2© 2018 NVIDIACORPORATION
9
The World’s Most Powerful AI Computer
NVIDIA DGX-2
2 PFLOPS | 512GB HBM2 | 10kW | 350 lbs
NVLink Plane Card
8x EDR IB/100 GigE
2x Xeon Platinum
1.5TB System Memory
PCIe Switch Complex
30TB NVME SSDs
16x Tesla V100 32GB12x NVSwitch
Parameter DGX-2 Spec
CPUs Dual Xeon Platinum 8168
CPU Memory 1.5 TB DDR4
Aggregate Storage 30 TB (8 NVMes)
Peak Max TDP 10 kW
Dimensions (H/W/D)17.3”(10U)/ 19.0”/32.8”
(440.0mm/482.3mm/834.0mm)
Weight 340 lbs (154.2 kgs)
Cooling (forced air) 1,000 CFM
© 2018 NVIDIA CORPORATION 2© 2018 NVIDIACORPORATION
Parameter DGX-2 Spec
Number of Tesla V100 GPUs 16
Aggregate FP64/FP32 125/250 TFLOPS
Aggregate Tensor (FP16) 2000 TFLOPS
Aggregate Shared HBM2 512 GB
Aggregate HBM2 Bandwidth 14.4 TBps
Per-GPU NVLink Bandwidth 300 GBps bidirectional
Chassis Bisection Bandwidth 2.4 TBps
InfiniBand NICs 8 Mellanox EDR
NVIDIA DGX-2: SPEEDS AND FEEDS
Xeon sockets are QPI connected, but affinity-binding keepsGPU-related traffic off QPI
PCIe tree has NICs connected to pairs of GPUs to facilitate GPUDirectTM RDMAs over IB network
Configuration and control of theNVSwitches is via driver processrunning on CPUs
DGX-2 PCIE NETWORK
NVNNVVSS
WWWIITITIT
CCCHH
V100
V100
V100
V100
V100
V100
V100
NNVVVSS
WWIITITT
CCCHH
V100
V100
V100
V100
V100
V100
V100
x86x86
PCIE SW
PCIE SW
PCIE SW
PCIE SW
PCIE SW
V100
x6
V100
x6
PCIE SW
PCIE SW
PCIE SW
PCIE SW
PCIE SW
100G NIC
100G NIC
100G NIC
100G NIC
100G NIC
100G PCIE NIC SW
100G NIC
100G PCIE NIC SW
NV
SW
ITCH
© 2018 NVIDIA CORPORATION 2© 2018 NVIDIACORPORATION
NV
SW
ITCH
PCIE QPI SW
QPI PCIESW
12
AI Compute
96 NVIDIA DGX2 Nodes
1,536 V100 GPUs
192 PF Peak
49 TB HBM2 memory
Networking
1 Terabit Data Bandwidth per Node
10 Mellanox EDR InfiniBand per Node
Fully Connected EDR InfiniBand Switch
NVIDIA DGX SUPERPODInstant AI Compute Infrastructure
13
NVIDIA DGX SUPERPOD
Mellanox EDR 100G InfiniBand Network
Mellanox Smart Director Switches
In-Network Computing Acceleration Engines
Fast and Efficient Storage Access with RDMA
Up to 130Tb/s Switching Capacity per Switch
Ultra-Low Latency of 300ns
Integrated Network Manager
Terabit-Speed InfiniBand Networking per Node
…
Rack 1 Rack 16
ComputeBackplane
Switch
Storage Backplane
Switch
64 DGX-2
GPFS
200 Gb/s per node
800 Gb/s per node
36
DGX-2 STRONG SCALINGGlobal Volume 324, Wilson-Dslash
12,000
8,000
4,000
0
16,000
1 2 8 16
20,000
MPI double SHMEM double
MPI single
MPI half
SHMEM single
SHMEM half
GFlo
p/s
4
#GPUs
37
MULTI-NODE STRONG SCALINGDGX SuperPOD, Wilson, 643x128 global volume
0
100,000
200,000
300,000
400,000
16 32 64 256 512 1024
half GDR
single GDR double GDR
half NVSHMEM
single NVSHMEM double NVSHMEM
GFlo
p/s
128
#GPUs
26
FERMION SOLVERS
Combination of algorithm (multigrid) and machine (GPUs)
A single Volta can solve at 1 second per Wilson solve with local volume of V=323x64 per GPU
A single node (DGX-2) can solve solve V=643x128 at one second per solve
16 nodes of DGX-2 can solve V=1283x256 at one second per solve
Fermion solvers are not the challenge they used to be (caveats unbound)
为这个时代的达芬奇和爱因斯坦提供计算力
We set out 26 years ago to transform
computer graphics.
Fueled by the massive growth of the gaming
market and its insatiable demand for better
3D graphics, we’ve evolved
the GPU into a computer brain at the
intersection of virtual reality, high
performance computing, and
artificial intelligence.
NVIDIA GPU computing has
become the essential tool of
the da Vincis and Einsteins
of our time. For them,
we’ve built the equivalent
of a time machine.
Instant AI HPC Infrastructure
DGX SUPERPOD
34
NVSHMEMGPU-centric communication
Implementation of OpenSHMEM, a Partitioned Global Address Space (PGAS) library
NVSHMEM features
Allows kernel-side communication (API and LD/ST) between GPUs
NVLink and PCIe support (intranode), InfiniBand support (internode)
X86 and Power9 support
Interoperability with MPI and OpenSHMEM libraries
NVSHMEM has been developed as an NVIDIA internal co-design with QUDA
Early access (EA2) available – please reach out to [email protected]
16
QUDA NODE PERFORMANCE OVER TIMEMultiplicative speedup through software and hardware
GFlo
p/s
1
10
100
1000
0
500
1,000
1,500
2,000
2,500
3,000Speedup
2008 2010 2012 2014 2016 2018 2019
Speedupdeterminedby measured time to solution forsolving the Wilson operatoragainst a randomsource on aV=24364 lattice,
β=5.5,Mπ= 416 MeV.Onenode is definedtobe3GPUs
Speedup
Multi GPU
capable
Adaptive
Multigrid
Wilson FP32 GFLOPS
OptimizedMultigrid
Deflated
Multigrid
300x