![Page 1: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data](https://reader036.vdocuments.us/reader036/viewer/2022063000/5f10a0cc7e708231d44a0a92/html5/thumbnails/1.jpg)
Jan 2018
TESLA PLATFORM
![Page 2: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data](https://reader036.vdocuments.us/reader036/viewer/2022063000/5f10a0cc7e708231d44a0a92/html5/thumbnails/2.jpg)
2
A NEW ERA OF COMPUTING
PC INTERNETWinTel, Yahoo!1 billion PC users
1995 2005 2015
MOBILE-CLOUDiPhone, Amazon AWS2.5 billion mobile users
AI & IOTDeep Learning, GPU100s of billions of devices
![Page 3: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data](https://reader036.vdocuments.us/reader036/viewer/2022063000/5f10a0cc7e708231d44a0a92/html5/thumbnails/3.jpg)
3
Artificial IntelligenceComputer GraphicsGPU Computing
NVIDIA“THE AI COMPUTING COMPANY”
![Page 4: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data](https://reader036.vdocuments.us/reader036/viewer/2022063000/5f10a0cc7e708231d44a0a92/html5/thumbnails/4.jpg)
4
1980 1990 2000 2010 2020
GPU-Computing perf
1.5X per year
1000X
by
2025
RISE OF GPU COMPUTING
Original data up to the year 2010 collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K.
Olukotun, L. Hammond, and C. Batten New plot and data collected for 2010-2015 by K. Rupp
102
103
104
105
106
107
Single-threaded perf
1.5X per year
1.1X per year
APPLICATIONS
SYSTEMS
ALGORITHMS
CUDA
ARCHITECTURE
![Page 5: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data](https://reader036.vdocuments.us/reader036/viewer/2022063000/5f10a0cc7e708231d44a0a92/html5/thumbnails/5.jpg)
5
ELEVEN YEARS OF GPU COMPUTING
2010
Fermi: World’s First HPC GPU
World’s First Atomic Model of HIV Capsid
GPU-Trained AI Machine Beats World Champion in Go
2014
Stanford Builds AI Machine using GPUs
World’s First 3-D Mapping of Human Genome
Google Outperforms Humans in ImageNet
2012
Discovered How H1N1 Mutates to Resist Drugs
Oak Ridge Deploys World’s Fastest Supercomputer w/ GPUs
2008
World’s First GPU Top500 System
2006
CUDA Launched
AlexNet beats expert code by huge margin using GPUs
Top 13 Greenest Supercomputers Powered
by NVIDIA GPUs
2017
![Page 6: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data](https://reader036.vdocuments.us/reader036/viewer/2022063000/5f10a0cc7e708231d44a0a92/html5/thumbnails/6.jpg)
6
TESLA PLATFORMWorld’s Leading Data Center Platform for Accelerating HPC and AI
TESLA GPU & SYSTEMS
NVIDIA SDK
INDUSTRY FRAMEWORKS & TOOLS
APPLICATIONS
FRAMEWORKS
INTERNET SERVICES
DEEP LEARNING SDK
CLOUDTESLA GPU NVIDIA DGX-1 NVIDIA HGX-1
ENTERPRISE APPLICATIONS
Manufacturing
Automotive
Healthcare Finance
Retail
Defense
…
DeepStream SDKNCCL cuBLAS
cuSPARSEcuDNN TensorRT
ECOSYSTEM TOOLS
HPC
+450 Applications
COMPUTEWORKS
CUDA C/C++ FORTRAN
SYSTEM OEMTESLA GPU NVIDIA DGX-1 CLOUDNVIDIA HGX-1TESLA GPU NVIDIA DGX-1
![Page 7: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data](https://reader036.vdocuments.us/reader036/viewer/2022063000/5f10a0cc7e708231d44a0a92/html5/thumbnails/7.jpg)
7
500+ GPU-ACCELERATED APPLICATIONS
All Top 15 HPC Apps Accelerated
VASP
AMBER
NAMD
GROMACS
Gaussian
Simulia Abaqus
WRF
OpenFOAM
ANSYS
LS-DYNA
BLAST
LAMMPS
ANSYS Fluent
Quantum Espresso
GAMESS
14X GPU DEVELOPERS
2017
615,00045,000
2012
DEFINING THE NEXT GIANT WAVE IN HPC
OAK RIDGE SUMMIT
US’s next fastest supercomputer
200+ Petaflop HPC; 3+ Exaflop of AI
ABCI Supercomputer (AIST)
Japan’s fastest AI supercomputer
Piz Daint
Europe’s fastest supercomputer
MOST ADOPTED PLATFORM FOR ACCELERATING HPC
![Page 8: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data](https://reader036.vdocuments.us/reader036/viewer/2022063000/5f10a0cc7e708231d44a0a92/html5/thumbnails/8.jpg)
8
EVERY DEEP LEARNING FRAMEWORK ACCELERATED
25X COMPANIES ENGAGED
2017
39,637
1500
2014
AVAILABLE EVERYWHERE
Cloud Services
Systems
Desktops
MOST ADOPTED PLATFORM FOR ACCELERATING AI
![Page 9: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data](https://reader036.vdocuments.us/reader036/viewer/2022063000/5f10a0cc7e708231d44a0a92/html5/thumbnails/9.jpg)
9
TESLA PLATFORM FOR HPC
![Page 10: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data](https://reader036.vdocuments.us/reader036/viewer/2022063000/5f10a0cc7e708231d44a0a92/html5/thumbnails/10.jpg)
10
0
10
20
30
40
50
60
70
80
20 40 60 80 1000
# of CPUs
1 Node with 4x V100 GPUs
48 CPU Nodes Comet Supercomputer
ns/
day
AMBER Simulation of CRISPR, Nature’s Tool for Genome Editing
ARCHITECTING MODERN DATACENTERSBIG INEFFICIENCIES WITH CPU NODES
Single GPU Server 3.5x Faster than the Largest CPU Data Center
AMBER 16 Pre-release, CRSPR based on PDB ID 5f9r, 336,898 atomsCPU: Dual Socket Intel E5-2680v3 12 cores, 128 GB DDR4 per node, FDR IB
![Page 11: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data](https://reader036.vdocuments.us/reader036/viewer/2022063000/5f10a0cc7e708231d44a0a92/html5/thumbnails/11.jpg)
11
WEAK NODESLots of Nodes Interconnected with
Vast Network Overhead
STRONG NODESFew Lightning-Fast Nodes with
Performance of Hundreds of Weak Nodes
Network Fabric
Server Racks
![Page 12: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data](https://reader036.vdocuments.us/reader036/viewer/2022063000/5f10a0cc7e708231d44a0a92/html5/thumbnails/12.jpg)
12
ARCHITECTING MODERN DATACENTERS
Strong Core CPU for Sequential code
Volta 5,120 CUDA Cores
125 TFLOPS Tensor Core
NVLink for Strong Scaling
ARCHITECTING MODERN DATACENTERS
![Page 13: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data](https://reader036.vdocuments.us/reader036/viewer/2022063000/5f10a0cc7e708231d44a0a92/html5/thumbnails/13.jpg)
13
70% OF THE WORLD’S SUPERCOMPUTINGWORKLOAD ACCELERATED
Intersect360 Research, Nov 2017 “HPC Application Support for GPU Computing”
VASP
AMBER
NAMD
GROMACS
Gaussian
Simulia Abaqus
WRF
OpenFOAM
ANSYS
LS-DYNA
BLAST
LAMMPS
ANSYS Fluent
Quantum Espresso
GAMESS
500+ Accelerated ApplicationsTop 15 HPC Applications
![Page 14: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data](https://reader036.vdocuments.us/reader036/viewer/2022063000/5f10a0cc7e708231d44a0a92/html5/thumbnails/14.jpg)
14
GPU-ACCELERATED HPC APPLICATIONS500+ APPLICATIONS
MFG, CAD, & CAE
111 apps
Including:• Ansys
Fluent• Abaqus
SIMULIA• AutoCAD• CST Studio
Suite
LIFE SCIENCES
50+app
Including:• Gaussian• VASP• AMBER• HOOMD-
Blue• GAMESS
DATA SCI. & ANALYTICS
Including:• MapD• Kinetica• Graphistry
23apps
DEEP LEARNING
32 apps
Including:• Caffe2• MXNet• Tensorflow
MEDIA & ENT.
142 apps
Including:• DaVinci
Resolve• Premiere
Pro CC• Redshift
Renderer
PHYSICS
20 apps
Including:• QUDA• MILC• GTC-P
OIL & GAS
17 apps
Including:• RTM• SPECFEM
3D
SAFETY & SECURITY
15apps
Including:• Cyllance• FaceControl• Syndex Pro
TOOLS & MGMT.
15apps
Including:• Bright
Cluster Manager
• HPCtoolkit• Vampir
FEDERAL & DEFENSE
13 apps
Including:• ArcGIS Pro• EVNI• SocetGXP
CLIMATE & WEATHER
4apps
Including:• Cosmos• Gales• WRF
COMP.FINANCE
16 apps
Including:• O-Quant
Options Pricing
• MUREX• MISYS
![Page 15: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data](https://reader036.vdocuments.us/reader036/viewer/2022063000/5f10a0cc7e708231d44a0a92/html5/thumbnails/15.jpg)
15
DEEP LEARNING COMES TO HPC
ERRORS
REGRESSION TESTING (FP16/INT8)
INFERENCE (FP16/INT8)
TRAINING (FP32/FP16)
SIMULATION (FP64/FP32)
NEW DATA
TRAINING SET REGRESSION SET NEW DATA
![Page 16: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data](https://reader036.vdocuments.us/reader036/viewer/2022063000/5f10a0cc7e708231d44a0a92/html5/thumbnails/16.jpg)
16
UIUC & NCSA: ASTROPHYSICS
5,000X LIGO Signal Processing
U. FLORIDA & UNC: DRUG DISCOVERY
300,000X Molecular Energetics Prediction
SLAC: ASTROPHYSICS
Gravitational Lensing: From Weeks to 10ms
AI ACCELERATES SCIENCE
U.S. DoE: PARTICLE PHYSICS
33% More Accurate Neutrino Detection
PRINCETON & ITER: CLEAN ENERGY
50% Higher Accuracy for Fusion Sustainment
U. PITT: DRUG DISCOVERY
35% Higher Accuracy for Protein Scoring
AI ACCELERATES SCIENTIFIC DISCOVERY
![Page 17: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data](https://reader036.vdocuments.us/reader036/viewer/2022063000/5f10a0cc7e708231d44a0a92/html5/thumbnails/17.jpg)
17
ONE PLATFORM BUILT FOR BOTHDATA SCIENCE & COMPUTATIONAL SCIENCE
Accelerating AITesla Platform Accelerating HPC
CUDA
![Page 18: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data](https://reader036.vdocuments.us/reader036/viewer/2022063000/5f10a0cc7e708231d44a0a92/html5/thumbnails/18.jpg)
18
DRAMATICALLY MORE FOR YOUR MONEY
1 RACK ($0.8M)
# of Racks (~30 KW Per Rack)
15105 20
360 CPUs
1152 CPUs
36 CPUs + 72 V100s
5 RACKS ($2.0M)
VASP
RTM
Compute Servers,
85%
Non-Compute 15%
EQUAL THROUGHPUT WITH FEWER RACKS BUDGET: SMALLER, EFFICIENT
Compute Servers,
39%
Rack, Cabling
Infrastructure
Networking
Non-compute,
61%
Source: Traditional Data Centers Cost model by Microsoft Research on Datacenter Costs
14 RACKS ($6.0M)
1764 CPUsResNet-50 (DL Training)22 RACKS ($9.2M)
25
Save Up To $8M With Each GPU-Accelerated Rack
0
![Page 19: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data](https://reader036.vdocuments.us/reader036/viewer/2022063000/5f10a0cc7e708231d44a0a92/html5/thumbnails/19.jpg)
19
DATA CENTER SAVINGS FOR MIXED WORKLOADS5X Better HPC TCO for Same Throughput
160 Self-hosted Servers
96 KWatts
12 Accelerated Servers w/4 V100 GPUs
20 KWatts
MIXED WORKLOAD:Materials Science (VASP)Life Sciences (AMBER)Physics (MILC)Deep Learning (ResNet-50)
SAMETHROUGHPUT
1/3 THE COST
1/4THE SPACE
1/5THE POWER
MIXED WORKLOAD:Materials Science (VASP)Life Sciences (AMBER)Physics (MILC)Deep Learning (ResNet-50)
![Page 20: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data](https://reader036.vdocuments.us/reader036/viewer/2022063000/5f10a0cc7e708231d44a0a92/html5/thumbnails/20.jpg)
20
TESLA V100The Fastest and Most Productive GPU for AI and HPC
Volta Architecture
Most Productive GPU
Tensor Core
125 Programmable
TFLOPS Deep Learning
Improved SIMT Model
New Algorithms
Volta MPS
Inference Utilization
Improved NVLink &
HBM2
Efficient Bandwidth
![Page 21: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data](https://reader036.vdocuments.us/reader036/viewer/2022063000/5f10a0cc7e708231d44a0a92/html5/thumbnails/21.jpg)
21
3+EFLOPSTensor Ops
AI Exascale Today
ACME
DIRAC FLASH GTC
HACC LSDALTON NAMD
NUCCOR NWCHEM QMCPACK
RAPTOR SPECFEM XGC
Accelerated Science
10XPerf Over Titan
20 PF
200 PF
Performance Leadership
VOLTA TO FUEL SUMMITNext Milestone In AI Supercomputing
5-10XApplication Perf Over Titan
![Page 22: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data](https://reader036.vdocuments.us/reader036/viewer/2022063000/5f10a0cc7e708231d44a0a92/html5/thumbnails/22.jpg)
22
BREAKTHROUGH EFFICIENCY ON THE PATH TO EXASCALE
Ahead Of The Curve
GFLO
PS p
er
Watt
0
5
10
15
20
25
30
35
9.5 SaturnV
P100
Top GPU Systems in Green500 List with measured performance and NVIDIA Projections for V100
33 GF/WExascale Goal
14.1 Tsubame 3
P1005.3 Tsubame-
KFCK80
4.4 Tsubame-
KFCK20X
3.2 EurotechAurora
K20
V100
13/13 Greenest Supercomputers Powered by Tesla P100
TSUBAME 3.0
Kukai
AIST AI Cloud
RAIDEN GPU subsystem
Piz Daint
Wilkes-2
GOSAT-2 (RCF2)
DGX Saturn V
Reedbush-H
JADE
Facebook Cluster
Cedar
DAVIDE
![Page 23: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data](https://reader036.vdocuments.us/reader036/viewer/2022063000/5f10a0cc7e708231d44a0a92/html5/thumbnails/23.jpg)
23
Delivered Value Grows Over Time
POWER OF GPU COMPUTING PLATFORM
0
10
20
30
40
50
60
K20 (2013)
K40(2014)
K80(2015)
P100(2016)
V100(2017)
AMBER Performance (ns/ day)
AMBER 12CUDA 4
AMBER 14CUDA 4
AMBER 14CUDA 6
AMBER 16CUDA 8
AMBER 16CUDA 9
0
2400
4800
7200
9600
12000
8X K80 (2014)
8X MAXWELL (2015)
DGX-1 (2016)
DGX-1V (2017)
GoogleNet Performance (i/s)
cuDNN 2CUDA 6
cuDNN 4CUDA 7
cuDNN 6CUDA 8
NCCL 1.6
cuDNN 7CUDA 9NCCL 2
Amber dataset: Cellulose NVE; GoogLeNet dataset: Imagenet
![Page 24: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data](https://reader036.vdocuments.us/reader036/viewer/2022063000/5f10a0cc7e708231d44a0a92/html5/thumbnails/24.jpg)
24
TESLA PLATFORM FOR AI
![Page 25: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data](https://reader036.vdocuments.us/reader036/viewer/2022063000/5f10a0cc7e708231d44a0a92/html5/thumbnails/25.jpg)
25
AI REVOLUTIONIZING OUR WORLD
Search, Assistants, Translation, Recommendations, Shopping, Photos… Detect, Diagnose and Treat Diseases
Powering Breakthroughs in Agriculture, Manufacturing, EDA
![Page 26: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data](https://reader036.vdocuments.us/reader036/viewer/2022063000/5f10a0cc7e708231d44a0a92/html5/thumbnails/26.jpg)
Bigger and More Compute Intensive
NEURAL NETWORK COMPLEXITY IS EXPLODING
2013 2014 2015 2016 2017 2018
Speech(GOP * Bandwidth)
DeepSpeech
DeepSpeech 2
DeepSpeech 3
30X
2011 2012 2013 2014 2015 2016 2017
Image(GOP * Bandwidth)
ResNet-50
Inception-v2
Inception-v4
AlexNet
GoogleNet
350X
2014 2015 2016 2017 2018
Translation(GOP * Bandwidth)
MoE
OpenNMT
GNMT
10X
![Page 27: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data](https://reader036.vdocuments.us/reader036/viewer/2022063000/5f10a0cc7e708231d44a0a92/html5/thumbnails/27.jpg)
27
PLATFORM BUILT FOR AIDelivering 125 TFLOPS of DL Performance with Volta
TENSOR CORE
VOLTA-OPTIMIZED cuDNN
MATRIX DATA OPTIMIZATION:
Dense Matrix of Tensor Compute
TENSOR-OP CONVERSION:
FP32 to Tensor Op Data for Frameworks
TENSOR CORE
VOLTA TENSOR CORE 4x4 matrix processing array
D[FP32] = A[FP16] * B[FP16] + C[FP32]Optimized For Deep Learning
ALL MAJOR FRAMEWORKS
![Page 28: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data](https://reader036.vdocuments.us/reader036/viewer/2022063000/5f10a0cc7e708231d44a0a92/html5/thumbnails/28.jpg)
28
Training
Device
Datacenter
GPU DEEP LEARNING IS A NEW COMPUTING MODEL
TRAINING
Billions of Trillions of Operations
GPU train larger models, accelerate
time to market
![Page 29: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data](https://reader036.vdocuments.us/reader036/viewer/2022063000/5f10a0cc7e708231d44a0a92/html5/thumbnails/29.jpg)
29
REVOLUTIONARY AI PERFORMANCE3X Faster DL Training Performance
3X Reduction in Time to Train Over P100
0 10 20
1XV100
1XP100
2XCPU
Relative Time to Train Improvements(LSTM)
Neural Machine Translation Training for 13 Epochs |German ->English, WMT15 subset | CPU = 2x Xeon E5 2699 V4
15 Days
18 Hours
6 Hours
Over 80X DL Training Performance in 3 Years
1x K80cuDNN2
4x M40cuDNN3
8x P100cuDNN6
8x V100cuDNN7
0x
20x
40x
60x
80x
100x
Q1
15
Q3
15
Q2
17
Q2
16
Exponential Performance over time(GoogleNet)
Speedup v
s K80
GoogleNet Training Performance on versions of cuDNNVs 1x K80 cuDNN2
![Page 30: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data](https://reader036.vdocuments.us/reader036/viewer/2022063000/5f10a0cc7e708231d44a0a92/html5/thumbnails/30.jpg)
30
NVIDIA GPUS POWER WORLD’S FASTEST DEEP LEARNING PERFORMANCE
Image of ResNet 50 network
(…) Preferred Networks Nov '171024 Tesla P100
IBM Aug '17256 Tesla P100
Facebook June '17256 Tesla P100
48 Mins
60 Mins
15 Mins
Time to Train
ResNet-50 ResNet-50 | Dataset: Imagenet | Trained for 90 Epochs
![Page 31: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data](https://reader036.vdocuments.us/reader036/viewer/2022063000/5f10a0cc7e708231d44a0a92/html5/thumbnails/31.jpg)
31
Training
Device
Datacenter
GPU DEEP LEARNING IS A NEW COMPUTING MODEL
DATACENTER INFERENCING
10s of billions of image, voice, video
queries per day
GPU inference for fast response,
maximize datacenter throughput
![Page 32: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data](https://reader036.vdocuments.us/reader036/viewer/2022063000/5f10a0cc7e708231d44a0a92/html5/thumbnails/32.jpg)
32
NVIDIA TENSORRT PROGRAMMABLE INFERENCE ACCELERATOR
TESLA V100
DRIVE PX 2
TESLA P4
JETSON TX2
NVIDIA DLA
TensorRT
![Page 33: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data](https://reader036.vdocuments.us/reader036/viewer/2022063000/5f10a0cc7e708231d44a0a92/html5/thumbnails/33.jpg)
33
IMAGES
0
1,000
2,000
3,000
4,000
5,000
6,000
Images/
Sec (
Targ
et
7m
s la
tency)NVIDIA TENSORRT 3
World’s Fastest Inference Platform
ResNet-50 Throughput
14ms
CPU + TensorFlow
V100 + TensorFlow
V100 +TensorRT
7ms 7ms
TRANSLATION
0
100
200
300
400
500
600
Sente
nces/
Sec (
Targ
et
200m
s la
tency) OpenNMT Throughput
280ms
CPU + Torch
V100 + Torch
V100 +TensorRT
153ms
117ms
![Page 34: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data](https://reader036.vdocuments.us/reader036/viewer/2022063000/5f10a0cc7e708231d44a0a92/html5/thumbnails/34.jpg)
34
NVIDIA PLATFORM SAVES DATA CENTER COSTSGame Changing Inference Performance
1 HGX Server
45,000 images/sec
3 KWatts
Image recognition using Resnet-50
160 CPU Servers
45,000 images/sec
65 KWatts
INFERENCE WORKLOAD:Image recognition using Resnet 50
INFERENCE WORKLOAD:Image recognition using Resnet 50
SAMETHROUGHPUT
1/4THE SPACE
1/22THE POWER
![Page 35: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data](https://reader036.vdocuments.us/reader036/viewer/2022063000/5f10a0cc7e708231d44a0a92/html5/thumbnails/35.jpg)
35
GPU-ACCELERATED INFERENCE
iFLYTEKSPEECH RECOGNITION
VALOSSAVIDEO INTELLIGENCE
MICROSOFT BINGVISUAL SEARCH
![Page 36: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data](https://reader036.vdocuments.us/reader036/viewer/2022063000/5f10a0cc7e708231d44a0a92/html5/thumbnails/36.jpg)
36
TESLA PRODUCT FAMILY
![Page 37: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data](https://reader036.vdocuments.us/reader036/viewer/2022063000/5f10a0cc7e708231d44a0a92/html5/thumbnails/37.jpg)
37
END-TO-END PRODUCT FAMILY
HYPERSCALE HPC
Deep learning training & inference
Training & Inference - Tesla V100
Most Efficient Inference & Transcoding - Tesla P4
STRONG-SCALE HPC
HPC and DL workloads scaling to multiple GPUs
Tesla V100 with NVLink
MIXED-APPS HPC
HPC workloads with mix of CPU and GPU workloads
Tesla V100 with PCI-E
FULLY INTEGRATED SUPERCOMPUTER
DGX-1 Server
Fully integrated deep learning solution
DGX Station
![Page 38: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data](https://reader036.vdocuments.us/reader036/viewer/2022063000/5f10a0cc7e708231d44a0a92/html5/thumbnails/38.jpg)
38
75% Perf at Half the Power
OPTIMIZED FOR DATACENTER EFFICIENCY30% More Performance in a Rack
Computer Vision
V100@ MAXQ
Computer Vision
V100@ MAXP
13 KW Rack4 Nodes of 8xV100
1XResNet-50 Rack
Throughput
13 KW Rack7 Nodes of 8xV100
1.3XResNet-50 Rack
Throughput
ResNet-50 Training
0
10
20
30
40
50
60
70
80
50 100 150 200 250 300
DL PerfDL Perf / Watt
Watts
Max Performance
Max Efficiency
![Page 39: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data](https://reader036.vdocuments.us/reader036/viewer/2022063000/5f10a0cc7e708231d44a0a92/html5/thumbnails/39.jpg)
39
TESLA V100
Core 5120 CUDA cores, 640 Tensor cores 5120 CUDA cores, 640 Tensor cores
Compute 7.8 TF DP ∙ 15.7 TF SP ∙ 125 TF DL 7 TF DP ∙ 14 TF SP ∙ 112 TF DL
Memory HBM2: 900 GB/s ∙ 16 GB HBM2: 900 GB/s ∙ 16 GB
InterconnectNVLink (up to 300 GB/s) +
PCIe Gen3 (up to 32 GB/s)PCIe Gen3 (up to 32 GB/s)
Power 300W 250W
Available Now Now
For NVLink Servers For PCIe Servers
![Page 40: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data](https://reader036.vdocuments.us/reader036/viewer/2022063000/5f10a0cc7e708231d44a0a92/html5/thumbnails/40.jpg)
40
TESLA PLATFORM FOR CLOUD PROVIDERS
![Page 41: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data](https://reader036.vdocuments.us/reader036/viewer/2022063000/5f10a0cc7e708231d44a0a92/html5/thumbnails/41.jpg)
41
CLOUD GPU DEMAND OUTSTRIPS SUPPLY
Q3 2016 Q4 2016
“P2 instance is one of the fastest growing
instance in AWS history.”
- Andrew Jassy, AWS CEO, re:Invent 2016
AWS Launches P2 Instance
“We’ve had thousands of customers participate
in the N-Series preview since we launched it
back in August.”
- Corey Sanders, Director of Compute, Azure
Azure Launches N-Series Preview
![Page 42: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data](https://reader036.vdocuments.us/reader036/viewer/2022063000/5f10a0cc7e708231d44a0a92/html5/thumbnails/42.jpg)
42
Compute
AWS P3 - up to 8X
V100 SXM2
Available only in N.
Virginia, Oregon, Ireland,
Tokyo
AWS P2 – up to 8X
K80 Physical cards
https://aws.amazon.com/
ec2/instance-types/p3/
https://aws.amazon.com
/ec2/instance-types/p2/
GPU Server - up to
4X K80
GPU Server - up to
4X P100 PCIe Public Beta available
https://cloud.google.com
/gpu/
GPU Server - up to
2X K80, 1X P100 PCIe
(In Bare-metal)
https://www.ibm.com/cl
oud-
computing/bluemix/gpu-
computing
NC series - up to 2X K80
NC v2 & ND series - up
to 4X P100 PCIe/ 4X P40 Available only in US West 2
Region
https://azure.microsoft.com/
en-us/pricing/details/virtual-
machines/series/#n- series
X7 shape - up to 2X
P100 (In Bare-metal
and VM)
– Available only in
Ashburn region. Frankfurt
to come in Jan 2018
https://cloud.oracle.com
/infrastructure/compute
Virtual W/S AWS G3 – M60
GPU Server - P100 PCIe
vWSprivate alpha available
GPU Server - P100 PCIe
vWSpublic beta – Jan 18
GPU Server - up to 2X
M60, 2X M10
https://www.ibm.com/cloud-
computing/bluemix/gpu-
computing
GPU Server - M60
https://azure.microsoft.com/
en-us/pricing/details/virtual-
machines/series/#n-series
Virtual PCGPU Server - up to 4X
K520 Physical cards
GPU Server - M60
GPU Server - M10
Vmware Horizon Air
vPC launch Jan
https://www.ibm.com/cloud-
computing/bluemix/gpu-
computing
GLOBAL CSP OFFERINGS
![Page 43: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data](https://reader036.vdocuments.us/reader036/viewer/2022063000/5f10a0cc7e708231d44a0a92/html5/thumbnails/43.jpg)
43
NVIDIA GPU CLOUD
Innovate in minutes, not weeksRemoves all the DIY complexity of DL and HPC software integration
Cross platformContainers run locally on DGX Systems and TITAN PCs, or on cloud service provider GPU instances
Always up to dateMonthly updates by NVIDIA to ensure maximum performance
AI and HPC Everywhere, For Everyone
NVIDIA GPU Cloud integrates GPU-optimized
deep learning frameworks, HPC apps,
runtimes, libraries, and OS into a ready-to-run
container, available at no charge
![Page 44: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data](https://reader036.vdocuments.us/reader036/viewer/2022063000/5f10a0cc7e708231d44a0a92/html5/thumbnails/44.jpg)
44
NVIDIA GPU CLOUDSIMPLIFYING AI & HPC
DEEP LEARNING HPC APPS HPC VIZ
![Page 45: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data](https://reader036.vdocuments.us/reader036/viewer/2022063000/5f10a0cc7e708231d44a0a92/html5/thumbnails/45.jpg)
45
NGC GPU-OPTIMIZED DEEPLEARNING CONTAINERS
NVCaffe
Caffe2
Microsoft Cognitive Toolkit (CNTK)
DIGITS
MXNet
PyTorch
TensorFlow
Theano
Torch
CUDA (base level container for developers)
NEW! – NVIDIA TensorRT inference accelerator with ONNX support
A Comprehensive Catalog of Deep Learning Software
![Page 46: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data](https://reader036.vdocuments.us/reader036/viewer/2022063000/5f10a0cc7e708231d44a0a92/html5/thumbnails/46.jpg)
46
HPC APPS COMING TO NVIDIA GPU CLOUD
![Page 47: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data](https://reader036.vdocuments.us/reader036/viewer/2022063000/5f10a0cc7e708231d44a0a92/html5/thumbnails/47.jpg)
47
Large-scale Volumetric Rendering
Physically Accurate Ray Tracing
Production-quality Images
Seamless integration with ParaView
Early Access NOW
Signup now at nvidia.com/gpu-cloud
U CLOUD FOR HPC VISUALIZATION
UNIFIED VISUALIZATIONFOR LARGE DATA SETS
ParaView with NVIDIA OptiX
ParaView with NVIDIA Holodeck
ParaView with NVIDIA IndeX
NVIDIA GPU CLOUD FOR HPC VISUALIZATION
![Page 48: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data](https://reader036.vdocuments.us/reader036/viewer/2022063000/5f10a0cc7e708231d44a0a92/html5/thumbnails/48.jpg)
48
TESLA PLATFORM FOR DEVELOPERS
![Page 49: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data](https://reader036.vdocuments.us/reader036/viewer/2022063000/5f10a0cc7e708231d44a0a92/html5/thumbnails/49.jpg)
49
![Page 50: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data](https://reader036.vdocuments.us/reader036/viewer/2022063000/5f10a0cc7e708231d44a0a92/html5/thumbnails/50.jpg)
50
HOW GPU ACCELERATION WORKSApplication Code
+
GPU CPU5% of Code
Compute-Intensive Functions
Rest of SequentialCPU Code
![Page 51: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data](https://reader036.vdocuments.us/reader036/viewer/2022063000/5f10a0cc7e708231d44a0a92/html5/thumbnails/51.jpg)
51
DEEP LEARNING
GPU ACCELERATED LIBRARIES“Drop-in” Acceleration for Your Applications
LINEAR ALGEBRA PARALLEL ALGORITHMS
SIGNAL, IMAGE & VIDEO
TensorRT
nvGRAPH NCCL
cuBLAS
cuSPARSE cuRAND
DeepStream SDK NVIDIA NPPcuFFT
CUDA
Math library
cuSOLVER
CODEC SDKcuDNN
![Page 52: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data](https://reader036.vdocuments.us/reader036/viewer/2022063000/5f10a0cc7e708231d44a0a92/html5/thumbnails/52.jpg)
52
CUDA TOOLKIT 9
Optimized for Volta:
• Tensor Cores
• Second-Generation NVLink
• HBM2 Stacked Memory
UNLEASHES POWER OF VOLTA
COOPERATIVE THREAD GROUPS
Flexible Thread Groups
Efficient Parallel Algorithms
• Synchronize Across Thread Blocks in a Single GPU or Multi-GPUs
• GEMM Optimizations for RNNs (cuBLAS)
• >20x Faster Image Processing (NPP)
• FFT Optimizations Across Various Sizes (cuFFT)
FASTER LIBRARIES
DEVELOPER TOOLS & PLATFORM UPDATES
• 1.3x Faster Compiling
• New OS and Compiler Support
• Unified Memory Profiling
• NVLink Visualization
![Page 53: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data](https://reader036.vdocuments.us/reader036/viewer/2022063000/5f10a0cc7e708231d44a0a92/html5/thumbnails/53.jpg)
53
WHAT IS OPENACC
main(){<serial code>#pragma acc kernels{ <parallel code>
}}
Add Simple Compiler Directive
OpenACC is a directives-
based programming approach
to parallel computing
designed for performance
and portability on CPUs
and accelerators for HPC (OpenPOWER, Sunway, x86 CPU & Xeon Phi, NVIDIA GPU, PEZY-SC)
Read more at www.openacc.org
![Page 54: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data](https://reader036.vdocuments.us/reader036/viewer/2022063000/5f10a0cc7e708231d44a0a92/html5/thumbnails/54.jpg)
54
OPENACC: EASY ONBOARD TO GPU COMPUTINGA Widely Adopted Directives Model for Parallel Programing
POWER
Sunway
x86 CPU
x86 Xeon Phi
NVIDIA GPU
AMD
PEZY-SC
0
20
40
60
80
100
120
140
160
Multicore Broadwell Multicore POWER8
PGI OpenACC
Intel/IBM OpenMP
10x 11x 11x
120x
77x
158x
AWE Hydrodynamics CloverLeaf mini-App
(bm32 data set)
SIMPLE. POWERFUL. PORTABLE.
Speedup v
s S
ingle
Hasw
ell
Core
10x
Volta V1002x1x 4x
5 CAAR Codes: GTC, XGC, ACME, FLASH, LSDalton
3 of Top 5 HPC Apps:ANSYS Fluent, VASP, Gaussian
2017 Gordon Bell Finalist:CAM-SE on TaihuLight
ADOPTED BY KEY HPC CODES
![Page 55: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data](https://reader036.vdocuments.us/reader036/viewer/2022063000/5f10a0cc7e708231d44a0a92/html5/thumbnails/55.jpg)
55
LSDalton
Quantum Chemistry
12X speedup in 1 week
Numeca
CFD
10X faster kernels2X faster app
PowerGrid
Medical Imaging
40 days to 2 hours
INCOMP3D
CFD
3X speedup
NekCEM
Computational Electromagnetics
2.5X speedup60% less energy
COSMO
Climate Weather
40X speedup3X energy efficiency
CloverLeaf
CFD
4X speedupSingle CPU/GPU code
MAESTROCASTRO
Astrophysics
4.4X speedup4 weeks effort
![Page 56: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data](https://reader036.vdocuments.us/reader036/viewer/2022063000/5f10a0cc7e708231d44a0a92/html5/thumbnails/56.jpg)
56
Resourceshttps://www.openacc.org/resources
Success Storieshttps://www.openacc.org/success-stories
Eventshttps://www.openacc.org/events
OPENACC RESOURCESGuides ● Talks ● Tutorials ● Videos ● Books ● Spec ● Code Samples ● Teaching Materials ● Events ● Success Stories ● Courses ● Slack ● Stack Overflow
Compilers and Tools https://www.openacc.org/tools
FREE
Compilers
![Page 57: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data](https://reader036.vdocuments.us/reader036/viewer/2022063000/5f10a0cc7e708231d44a0a92/html5/thumbnails/57.jpg)
57
NVIDIA DEEP LEARNING SDK
Powerful tools and libraries for designing and deploying GPU-accelerated deep learning applications
High performance building blocks for training and deploying deep neural networks on NVIDIA GPUs
Industry vetted deep learning algorithms and linear algebra subroutines for developing novel deep neural networks
Multi-GPU and multi-node scaling that accelerates training on up to eight GPU
High performance GPU-acceleration for deep learning
“ We are amazed by the steady stream
of improvements made to the NVIDIA
Deep Learning SDK and the speedups
that they deliver.”
— Frédéric Bastien, Team Lead (Theano) MILA
developer.nvidia.com/deep-learning-software
![Page 58: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data](https://reader036.vdocuments.us/reader036/viewer/2022063000/5f10a0cc7e708231d44a0a92/html5/thumbnails/58.jpg)
58
NVIDIA COLLECTIVECOMMUNICATIONS LIBRARY (NCCL)
Multi-GPU and multi-node collective communication primitives
developer.nvidia.com/nccl
High-performance multi-GPU and multi-node collective communication primitives optimized for NVIDIA GPUs
Fast routines for multi-GPU multi-node acceleration that maximizes inter-GPU bandwidth utilization
Easy to integrate and MPI compatible. Uses automatic topology detection to scale HPC and deep learning applications over PCIe and NVLink
Accelerates leading deep learning frameworks such as Caffe2, Microsoft Cognitive Toolkit, MXNet, PyTorch and more
Multi-Node:
InfiniBand verbs,
IP Sockets
Multi-GPU:
NVLink, PCIe
Automatic
Topology
Detection
216.925
843.475
1684.79
3281.07
6569.6
0
1,000
2,000
3,000
4,000
5,000
6,000
7,000
8,000
0 8 16 24 32
NCCL 2
Images/
Second
Near-Linear Multi-Node Scaling
Microsoft Cognitive Toolkit multi-node scaling performance (images/sec), NVIDIA DGX-1 + cuDNN 6
(FP32), ResNet50, Batch size: 64
![Page 59: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data](https://reader036.vdocuments.us/reader036/viewer/2022063000/5f10a0cc7e708231d44a0a92/html5/thumbnails/59.jpg)
59
NVIDIA DIGITSInteractive Deep Learning GPU Training System
developer.nvidia.com/digits
Interactive deep learning training application for engineers and data scientists
Simplify deep neural network training with an interactive interface to train and validate, and visualize results
Built-in workflows for image classification, object detection and image segmentation
Improve model accuracy with pre-trained models from the DIGITS Model Store
Faster time to solution with multi-GPU acceleration
![Page 60: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data](https://reader036.vdocuments.us/reader036/viewer/2022063000/5f10a0cc7e708231d44a0a92/html5/thumbnails/60.jpg)
60
NVIDIA cuDNNDeep Learning Primitives
developer.nvidia.com/cudnn
High performance building blocks for deep learning frameworks
Drop-in acceleration for widely used deep learning frameworks such as Caffe2, Microsoft Cognitive Toolkit, PyTorch, Tensorflow, Theano and others
Accelerates industry vetted deep learning algorithms, such as convolutions, LSTM RNNs, fully connected, and pooling layers
Fast deep learning training performance tuned for NVIDIA GPUs
“ NVIDIA has improved the speed of cuDNN
with each release while extending the
interface to more operations and devices
at the same time.”
— Evan Shelhamer, Lead Caffe Developer, UC Berkeley
0
2,000
4,000
6,000
8,000
10,000
12,000
8x K80 8x Maxwell DGX-1 DGX-1V
Images/
Second
cuDNN 7
NCCL 2
cuDNN 6
NCCL 1.6
cuDNN 4
cuDNN 2
Deep Learning Training Performance
![Page 61: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data](https://reader036.vdocuments.us/reader036/viewer/2022063000/5f10a0cc7e708231d44a0a92/html5/thumbnails/61.jpg)
61
NVIDIA TensorRT 3
Compiler for Optimized Neural Networks
Weight & Activation Precision Calibration
Layer & Tensor Fusion
Kernel Auto-Tuning
Multi-Stream Execution
TensorRT
Compiled & Optimized Neural
Network
Trained NeuralNetwork
Kernel Auto-tuning
Layer & Tensor Fusion
Dynamic Tensor
Memory
Weight & Activation
Precision Calibration
Multi-Stream
Execution
Programmable Inference Accelerator
![Page 62: Tesla Master Deck - HPC Saudi 2019 | Saudi HPC 2019 · ARCHITECTING MODERN DATACENTERS. BIG INEFFICIENCIES WITH CPU NODES . Single GPU Server 3.5x Faster than the Largest CPU Data](https://reader036.vdocuments.us/reader036/viewer/2022063000/5f10a0cc7e708231d44a0a92/html5/thumbnails/62.jpg)