hpc and ai acceleration on gpu · 2019-03-13 · 3 1980 1990 2000 2010 2020 gpu-computing perf 1.5x...
TRANSCRIPT
![Page 1: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched](https://reader030.vdocuments.us/reader030/viewer/2022040210/5e4dea154781d06aeb0467d1/html5/thumbnails/1.jpg)
HPC AND AI ACCELERATION ON GPU
Yi Cheng(易成) SA HPC&AI March 2019
![Page 2: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched](https://reader030.vdocuments.us/reader030/viewer/2022040210/5e4dea154781d06aeb0467d1/html5/thumbnails/2.jpg)
2
Artificial IntelligenceComputer GraphicsGPU Computing
NVIDIA“THE AI COMPUTING COMPANY”
![Page 3: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched](https://reader030.vdocuments.us/reader030/viewer/2022040210/5e4dea154781d06aeb0467d1/html5/thumbnails/3.jpg)
3
1980 1990 2000 2010 2020
GPU-Computing perf
1.5X per year
1000X
by
2025
RISE OF GPU COMPUTING
Original data up to the year 2010 collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K.
Olukotun, L. Hammond, and C. Batten New plot and data collected for 2010-2015 by K. Rupp
102
103
104
105
106
107
Single-threaded perf
1.5X per year
1.1X per year
APPLICATIONS
SYSTEMS
ALGORITHMS
CUDA
ARCHITECTURE
![Page 4: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched](https://reader030.vdocuments.us/reader030/viewer/2022040210/5e4dea154781d06aeb0467d1/html5/thumbnails/4.jpg)
4
ELEVEN YEARS OF GPU COMPUTING
2010
Fermi: World’s First HPC GPU
World’s First Atomic Model of HIV Capsid
GPU-Trained AI Machine Beats World Champion in Go
2014
Stanford Builds AI Machine using GPUs
World’s First 3-D Mapping of Human Genome
Google Outperforms Humans in ImageNet
2012
Discovered How H1N1 Mutates to Resist Drugs
Oak Ridge Deploys World’s Fastest Supercomputer w/ GPUs
2008
World’s First GPU Top500 System
2006
CUDA Launched
AlexNet beats expert code by huge margin using GPUs
Top 13 Greenest Supercomputers Powered
by NVIDIA GPUs
2017
![Page 5: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched](https://reader030.vdocuments.us/reader030/viewer/2022040210/5e4dea154781d06aeb0467d1/html5/thumbnails/5.jpg)
5
GPU EVOLUTIONSG
EM
M /
W N
orm
alized
2012 20142008 2010 2016
TeslaCUDA
FermiFP64
KeplerDynamic Parallelism
MaxwellDX12
PascalUnified Memory
3D Memory
NVLink
20
16
12
8
6
2
0
4
10
14
18
M2090
M2070
C2070
C2050
C1070
C1060
C870
K80
K40
K20
M60
M40
M10
P100
P40
P4
![Page 6: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched](https://reader030.vdocuments.us/reader030/viewer/2022040210/5e4dea154781d06aeb0467d1/html5/thumbnails/6.jpg)
6
NVIDIA POWERS WORLD’SFASTEST SUPERCOMPUTERS48% More Systems | 22 of Top 25 Greenest
Piz DaintEurope’s Fastest
5,704 GPUs| 21 PF
ORNL SummitWorld’s Fastest
27,648 GPUs| 144 PF
ABCIJapan’s Fastest
4,352 GPUs| 20 PF
ENI HPC4Fastest Industrial
3,200 GPUs| 12 PF
LLNL SierraWorld’s 2nd Fastest
17,280 GPUs| 95 PF
![Page 7: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched](https://reader030.vdocuments.us/reader030/viewer/2022040210/5e4dea154781d06aeb0467d1/html5/thumbnails/7.jpg)
7
NVIDIA POWER GORDON BELL WINNERS & 5 OF 6 FINALISTS
GPU Acceleration Critical To HPC At Scale Today
Material Science300X HigherPerformance
Genomics 2.36 ExaFLOPS
Seismic1st Soil & Structure
Simulation
Quantum Chromodynamics
<1% of Uncertainty Margin
Weather1.15 ExaFLOPS
![Page 8: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched](https://reader030.vdocuments.us/reader030/viewer/2022040210/5e4dea154781d06aeb0467d1/html5/thumbnails/8.jpg)
8
END-TO-END PRODUCT FAMILY
DESKTOP
TITAN/GeForce
WORKSTATION
DGX Station
DATA CENTER
Tesla V100
AUTOMOTIVE
Drive AGX Pegasus
VIRTUAL
WORKSTATION
Virtual GPU
SERVER
PLATFORM
HGX1/ HGX2
HPC / TRAINING INFERENCE
EMBEDDED
Jetson AGX Xavier
DATA CENTER
Tesla V100
Tesla P4/T4
FULLY INTEGRATED AI SYSTEMS
DGX-1 DGX-2
![Page 9: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched](https://reader030.vdocuments.us/reader030/viewer/2022040210/5e4dea154781d06aeb0467d1/html5/thumbnails/9.jpg)
9
TESLA PRODUCT FAMILY
V100 SXM2with NVLINK
V100 PCIe2 slot
HGX-2 Baseboard16 V100 + NVSwitch
HGX-2: V100 & NVSwitch heat sink included but not shown
Supercomputing
DL Training & Inference
Machine Learning
Video | Graphics
TESLA V100 (Scale-up)
DL Inference &
Training
Machine Learning
Video | Graphics
TESLA T4 (Scale-out)
T4 PCIeLow Profile, 70W
![Page 10: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched](https://reader030.vdocuments.us/reader030/viewer/2022040210/5e4dea154781d06aeb0467d1/html5/thumbnails/10.jpg)
10
APPS &FRAMEWORKS
NVIDIA SDK& LIBRARIES
NVIDIA UNIVERSAL ACCELERATION PLATFORMSingle Platform Drives Utilization and Productivity
MACHINE LEARNING / ANALYTICS
cuMLcuDF cuGRAPH
CUDA
DEEP LEARNING
cuDNN cuBLAS CUTLASS NCCL TensorRT
HPC
CuBLAS OpenACCCuFFT
+580 Applications
Amber
NAMD
CUSTOMER USECASES
TESLA GPUs & SYSTEMS
HGX-2Scale up,
Dense Compute
T4Scale out,
Distributed Compute
Speech Translate Recommender Molecular Simulations
WeatherForecasting
SeismicMapping
ManufacturingHealthcare Finance
CONSUMER INTERNET, INDUSTRIAL, and SCIENTIFIC APPLICATIONS
Video | Images Retail
![Page 11: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched](https://reader030.vdocuments.us/reader030/viewer/2022040210/5e4dea154781d06aeb0467d1/html5/thumbnails/11.jpg)
11
TESLA V100
Core 5120 CUDA cores, 640 Tensor cores 5120 CUDA cores, 640 Tensor cores
Compute 7.8 TF DP ∙ 15.7 TF SP ∙ 125 TF DL 7 TF DP ∙ 14 TF SP ∙ 112 TF DL
Memory HBM2: 900 GB/s ∙ 32 GB HBM2: 900 GB/s ∙ 32 GB
InterconnectNVLink (up to 300 GB/s) +
PCIe Gen3 (up to 32 GB/s)PCIe Gen3 (up to 32 GB/s)
Power 300W 250W
Available Now Now
For NVLink Servers For PCIe Servers
![Page 12: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched](https://reader030.vdocuments.us/reader030/viewer/2022040210/5e4dea154781d06aeb0467d1/html5/thumbnails/12.jpg)
12
TESLA V100The Fastest and Most Productive GPU for AI and HPC
Volta Architecture
Most Productive GPU
Tensor Core
125 Programmable
TFLOPS Deep Learning
Improved SIMT Model
New Algorithms
Volta MPS
Inference Utilization
Improved NVLink 2.0
(300GB/s)&HBM2(900GB/s)
Efficient Bandwidth
![Page 13: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched](https://reader030.vdocuments.us/reader030/viewer/2022040210/5e4dea154781d06aeb0467d1/html5/thumbnails/13.jpg)
13
INTRODUCING TESLA P100New GPU Architecture to Enable the World’s Fastest Compute Node
Pascal Architecture NVLink(160GB/s) CoWoS HBM2(768GB/s) Page Migration Engine
PCIe
Switch
PCIe
Switch
CPU CPU
Highest Compute Performance GPU Interconnect for Maximum Scalability
Unifying Compute & Memory in Single Package
Simple Parallel Programming with Virtually Unlimited Memory
Unified Memory
CPU
Tesla P100
![Page 14: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched](https://reader030.vdocuments.us/reader030/viewer/2022040210/5e4dea154781d06aeb0467d1/html5/thumbnails/14.jpg)
14
P100 V100 Ratio
Training acceleration 10 TOPS 125 TOPS 12x
Inference acceleration 21 TFLOPS 125 TOPS 6x
FP64/FP32 5/10 TFLOPS7.8/15.7
TFLOPS1.5x
HBM2 Bandwidth 720 GB/s 900 GB/s 1.2x
NVLink Bandwidth 160 GB/s 300 GB/s 1.9x
L2 Cache 4 MB 6 MB 1.5x
L1 Caches 1.3 MB 10 MB 7.7x
GPU PERFORMANCE COMPARISON
![Page 15: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched](https://reader030.vdocuments.us/reader030/viewer/2022040210/5e4dea154781d06aeb0467d1/html5/thumbnails/15.jpg)
15
21B transistors815 mm2
80 SM5120 CUDA Cores640 Tensor Cores
32 GB HBM2900 GB/s HBM2
300 GB/s NVLink
VOLTA GV100
*full GV100 chip contains 84 SMs
![Page 16: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched](https://reader030.vdocuments.us/reader030/viewer/2022040210/5e4dea154781d06aeb0467d1/html5/thumbnails/16.jpg)
16
VOLTA GV100 SM
GV100
FP32 units 64
FP64 units 32
INT32 units 64
Tensor Cores 8
Register File 256 KB
Unified L1/Shared
memory
128 KB
Active Threads 2048
![Page 17: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched](https://reader030.vdocuments.us/reader030/viewer/2022040210/5e4dea154781d06aeb0467d1/html5/thumbnails/17.jpg)
GPU P100 SM (Streaming Multiprocessor )
GP100
SM/GPU 56
FP32 units 64
FP64 units 32
TensorCore -
Register/SM 256 KB
Shared
Memory/SM64 KB
L1 Cache 24 KB
![Page 18: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched](https://reader030.vdocuments.us/reader030/viewer/2022040210/5e4dea154781d06aeb0467d1/html5/thumbnails/18.jpg)
18
TENSOR COREMixed Precision Matrix Math4x4 matrices
D = AB + C
D =
FP16 or FP32 FP16 FP16 FP16 or FP32
A0,0 A0,1 A0,2 A0,3
A1,0 A1,1 A1,2 A1,3
A2,0 A2,1 A2,2 A2,3
A3,0 A3,1 A3,2 A3,3
B0,0 B0,1 B0,2 B0,3
B1,0 B1,1 B1,2 B1,3
B2,0 B2,1 B2,2 B2,3
B3,0 B3,1 B3,2 B3,3
C0,0 C0,1 C0,2 C0,3
C1,0 C1,1 C1,2 C1,3
C2,0 C2,1 C2,2 C2,3
C3,0 C3,1 C3,2 C3,3
![Page 19: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched](https://reader030.vdocuments.us/reader030/viewer/2022040210/5e4dea154781d06aeb0467d1/html5/thumbnails/19.jpg)
19
BASIC CONCEPTSVOLTA TRAINING METHOD
![Page 20: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched](https://reader030.vdocuments.us/reader030/viewer/2022040210/5e4dea154781d06aeb0467d1/html5/thumbnails/20.jpg)
20
USING TENSOR CORES
Volta Optimized Frameworks and Libraries
__device__ void tensor_op_16_16_16(float *d, half *a, half *b, float *c)
{wmma::fragment<matrix_a, …> Amat;wmma::fragment<matrix_b, …> Bmat;wmma::fragment<matrix_c, …> Cmat;
wmma::load_matrix_sync(Amat, a, 16);wmma::load_matrix_sync(Bmat, b, 16);wmma::fill_fragment(Cmat, 0.0f);
wmma::mma_sync(Cmat, Amat, Bmat, Cmat);
wmma::store_matrix_sync(d, Cmat, 16,wmma::row_major);
}
CUDA C++
Warp-Level Matrix Operations
NVIDIA cuDNN, cuBLAS, TensorRT
![Page 21: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched](https://reader030.vdocuments.us/reader030/viewer/2022040210/5e4dea154781d06aeb0467d1/html5/thumbnails/21.jpg)
21
VOLTA: A GIANT LEAP FOR DEEP LEARNING
P100 V100 P100 V100
Images
per
Second
Images
per
Second
2.4x faster 3.7x faster
FP32 Tensor Cores FP16 Tensor Cores
V100 measured on pre-production hardware.
ResNet-50 Training ResNet-50 Inference
TensorRT - 7ms Latency
![Page 22: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched](https://reader030.vdocuments.us/reader030/viewer/2022040210/5e4dea154781d06aeb0467d1/html5/thumbnails/22.jpg)
22
Universal Inference Acceleration
320 Turing Tensor cores
2,560 CUDA cores
65 FP16 TFLOPS | 130 INT8 TOPS | 260 INT4 TOPS
16GB | 320GB/s
ANNOUNCING TESLA T4WORLD’S MOST ADVANCED INFERENCE GPU
![Page 23: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched](https://reader030.vdocuments.us/reader030/viewer/2022040210/5e4dea154781d06aeb0467d1/html5/thumbnails/23.jpg)
23
TURING
Turing: Up to 72 Streaming Multiprocessors (SM)
![Page 24: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched](https://reader030.vdocuments.us/reader030/viewer/2022040210/5e4dea154781d06aeb0467d1/html5/thumbnails/24.jpg)
24
TURINGPer Streaming Multiprocessor:
• 64 FP32 lanes
• 2 FP64 lanes
• 64 INT32 lanes
• 16 SFU lanes (transcendentals)
• 32 LD/ST lanes (Gmem/Lmem/Smem)
• 8 Tensor Cores
• 40 RT Cores
• 4 TEX lanes
SM
L1 …SM
L1
SM
L1
SM
L1
SM
L1
L2
DRAM
Up to 72 SMs (T4: 40 SMs )
![Page 25: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched](https://reader030.vdocuments.us/reader030/viewer/2022040210/5e4dea154781d06aeb0467d1/html5/thumbnails/25.jpg)
NEW TURING TENSOR CORE
MULTI-PRECISION FOR AI INFERENCE
65 TFLOPS FP16 | 130 TeraOPS INT8 | 260 TeraOPS INT4
![Page 26: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched](https://reader030.vdocuments.us/reader030/viewer/2022040210/5e4dea154781d06aeb0467d1/html5/thumbnails/26.jpg)
26
RT CORESTuring GPU RT Cores Accelerate Ray Tracing
RT Cores accelerate ray tracing
• Hardware accelerated tracing of rays through the scene
• RT Core performance scales up with the Quadro RTX product family
• Applications access capabilities of RT Cores through OptiX, DXR, and Vulkan APIs
![Page 27: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched](https://reader030.vdocuments.us/reader030/viewer/2022040210/5e4dea154781d06aeb0467d1/html5/thumbnails/27.jpg)
27
K80P100
(SXM2)
P100
(PCIE)P40 P4
V100
(PCIE)
V100
(SXM2)
V100
(FHHL)
GPU 2x GK210 GP100 GP100 GP102 GP104 GV100 GV100 GV100
PEAK FP64 (TFLOPs) 2.9 5.3 4.7 NA NA 7 7.8 6.5
PEAK FP32 (TFLOPs) 8.7 10.6 9.3 12 5.5 14 15.7 13
PEAK FP16 (TFLOPs) NA 21.2 18.7 NA NA 112 125 105
PEAK TIOPs NA NA NA 47 22 NA NA NA
Memory Size2x 12GB
GDDR516 GB HBM2
16/12 GB
HBM2
24 GB
GDDR58 GB GDDR5 16GB HBM2 16GB HBM2 16GB HBM2
Memory BW 480 GB/s 732 GB/s732/549
GB/s346 GB/s 192 GB/s 900 GB/s 900 GB/s 900 GB/s
Interconnect PCIe Gen3NVLINK +
PCIe Gen3PCIe Gen3 PCIe Gen3 PCIe Gen3 PCIe Gen3
NVLINK +
PCIe Gen3PCIe Gen3
ECCInternal +
GDDR5
Internal +
HBM2
Internal +
HBM2GDDR5 GDDR5
Internal +
HBM2
Internal +
HBM2
Internal +
HBM2
Form Factor PCIE Dual Slot SXM2PCIE Dual
Slot
PCIE Dual
SlotPCIE LP
PCIE Dual
SlotSXM2
PCIE Single
Slot Full
Height Half
Length
Power 300 W 300 W 250 W 250 W 50-75 W 250W 300W 150W
TESLA PRODUCTS DECODER
![Page 28: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched](https://reader030.vdocuments.us/reader030/viewer/2022040210/5e4dea154781d06aeb0467d1/html5/thumbnails/28.jpg)
28
SPECS OVERVIEW
Tesla P100 Tesla P4Tesla
V100Tesla T4
GPU P100 P104 V100 TU104
CC 6.0 6.1 7.0 7.5
Mem GB/s 732 192 900 320
FP32 TFlops 10.0 5.5 15.5 8.1
FP64 TFlops 5.0 0.2 7.8 0.25
FP16 TFlops 20.0 0.12 31.1 16.2
HMMA TFlops - - 124.5 65
IMMA8 TOps - - - 130
IMMA4 TOps - - - 260
TDP 300W 75W 300W 70W
TensorCores
![Page 29: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched](https://reader030.vdocuments.us/reader030/viewer/2022040210/5e4dea154781d06aeb0467d1/html5/thumbnails/29.jpg)
29
TESLA V100 VS P100
Tesla V100 Tensor Core 和 CUDA 9 对 GEMM 运算有了 9 倍的性能提升。(在 Tesla V100 样机上使用CUDA 9 软件进行的测试)
32GB
![Page 30: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched](https://reader030.vdocuments.us/reader030/viewer/2022040210/5e4dea154781d06aeb0467d1/html5/thumbnails/30.jpg)
30
DGX计算平台
![Page 31: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched](https://reader030.vdocuments.us/reader030/viewer/2022040210/5e4dea154781d06aeb0467d1/html5/thumbnails/31.jpg)
31
DGX PRODUCTS FAMILY
The Fastest Personal Supercomputer for Researchers and Data Scientists
The Essential Instrument of AI Research in data center
The World’s Most Powerful deep learning System for the Most Complex deep learning Challenges
DGX-1
DGX Station DGX-2
![Page 32: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched](https://reader030.vdocuments.us/reader030/viewer/2022040210/5e4dea154781d06aeb0467d1/html5/thumbnails/32.jpg)
32
NVIDIA DGX-1 WITH VOLTAHighest Performance, Fully Integrated HW System
1 PetaFLOPS | 8x Tesla V100 32GB | 300 Gb/s NVLink Hybrid Cube Mesh
2x Xeon | 7 TB RAID 0 | Quad IB/Ethernet 100Gbps, Dual 10GbE | 3U — 3500W
7 TB SSD 8 x Tesla V100 32 GB
Quad IB/Ethernet 100Gbps, Dual 10GbE
2x Xeon
3U – 3200W NVLink Hybrid Cube Mesh
![Page 33: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched](https://reader030.vdocuments.us/reader030/viewer/2022040210/5e4dea154781d06aeb0467d1/html5/thumbnails/33.jpg)
33
VOLTA NVLINK300GB/sec
50% more links
28% faster signaling
![Page 34: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched](https://reader030.vdocuments.us/reader030/viewer/2022040210/5e4dea154781d06aeb0467d1/html5/thumbnails/34.jpg)
34
V100 (16 GB) V100 (32 GB)
VGG-16(16 Layers)
ResNet-152
(152 Layers)
More Complex
Models Now Possible
Dramatic Boost
in Accuracy
DRAMATIC BOOST IN ACCURACY WITH LARGER, MORE COMPLEX MODELS
SAP Brand Impact on DGX-1 (32 GB) for Object Detection
Dataset: Winter Sports 2018 Campaign; high definition resolution images (1920 x1080)
40% Reduced Error Rate
![Page 35: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched](https://reader030.vdocuments.us/reader030/viewer/2022040210/5e4dea154781d06aeb0467d1/html5/thumbnails/35.jpg)
35
FASTER RESULTS ON COMPLEX DL AND HPCUp to 50% Faster Results With 2x The Memory
Unsupervised Image Translation
Input winter photo
AI converts it to summer
Dual E5-2698v4 server, 512GB DDR4, Ubuntu 16.04, CUDA9, cuDNN7| NMT is GNMT-like and run with TensorFlow NGC Container 18.01 (Batch Size= 128 (for 16GB) and 256 (for 32GB) | FFT is with cufftbench 1k x 1k x 1k and comparing 2 V100 16GB (DGX1V) vs. 2 V100 32GB (DGX1V)
Neural MachineTranslation (NMT)
3D FFT 1k x 1k x 1k
1.5X Faster Calculations
1.5X Faster Language Translation
1.2step/sec
0.8step/sec
2.5TF
3.8TF
GAN Image to ImageGen
1024x1024res images
512x512res images
4X Higher resolution
75%Accuracy
(16 layers)
85%Accuracy
(152 layers)
HIGHER ACCURACY HIGHER RESOLUTIONFASTER RESULTS
NVIDIA customer R-CNN for object detection at 1080P with Caffe | V100 16GB uses VGG16| V100 32GB uses Resnet-152
V100 16GB V100 32GB
VGG-16 RN-152
1.4X Lower Error Rate
GAN by NVRESEARCH (https://arxiv.org/pdf/1703.00848.pdf) | V100 16GB and V100 32GB with FP32
![Page 36: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched](https://reader030.vdocuments.us/reader030/viewer/2022040210/5e4dea154781d06aeb0467d1/html5/thumbnails/36.jpg)
36
30% BETTER PERFORMANCE WITH NVLINK THAN PCIE
• Encoder and decoder embedding size of 512
• Batch size of 256 per GPU
• NVIDIA DGX containers version 17.11, processing real data with cuDNN 7.0.4, NCCL 2.1.2
![Page 37: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched](https://reader030.vdocuments.us/reader030/viewer/2022040210/5e4dea154781d06aeb0467d1/html5/thumbnails/37.jpg)
37
2.54X BETTER PERFORMANCE WITH NVLINK
• Performance benefits increase with increasing encoder/ decoder embedding size
• Sockeye neural machine translation single-precision training
• NVIDIA DGX containers version 17.11, processing real data with cuDNN 7.0.4, NCCL 2.1.2
![Page 38: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched](https://reader030.vdocuments.us/reader030/viewer/2022040210/5e4dea154781d06aeb0467d1/html5/thumbnails/38.jpg)
38
3.1X FASTER ON DGX-1 V100 THAN DGX-1 P100
![Page 39: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched](https://reader030.vdocuments.us/reader030/viewer/2022040210/5e4dea154781d06aeb0467d1/html5/thumbnails/39.jpg)
39
DGX STATION
![Page 40: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched](https://reader030.vdocuments.us/reader030/viewer/2022040210/5e4dea154781d06aeb0467d1/html5/thumbnails/40.jpg)
40
INTRODUCING NVIDIA DGX STATIONGroundbreaking AI – at your desk
The Fastest Personal Supercomputer for Researchers and Data Scientists
Revolutionary form factor -designed for the desk, whisper-quiet
Start experimenting in hours, not weeks, powered by DGX Stack
Productivity that goes from desk to data center to cloud
Breakthrough performance and precision – powered by Volta
40
![Page 41: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched](https://reader030.vdocuments.us/reader030/viewer/2022040210/5e4dea154781d06aeb0467d1/html5/thumbnails/41.jpg)
41
INTRODUCING NVIDIA DGX STATIONGroundbreaking AI – at your desk
The Personal AI Supercomputer for Researchers and Data Scientists
41
Key Features
1. 4 x NVIDIA Tesla V100 GPU (NOW 32 GB)
2. 2nd-gen NVLink (4-way)
3. Water-cooled design
4. 3 x DisplayPort (4K resolution)
5. Intel Xeon E5-2698v4 20-core
6. 256GB DDR4 RAM
2
1
5
4
3
6
![Page 42: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched](https://reader030.vdocuments.us/reader030/viewer/2022040210/5e4dea154781d06aeb0467d1/html5/thumbnails/42.jpg)
42
NVIDIA DGX STATION
SPECIFICATIONS
At a GlanceGPUs 4x NVIDIA® Tesla® V100
TFLOPS (GPU FP16) 500
GPU Memory 32 GB per GPU
NVIDIA Tensor Cores 2,560 (total)
NVIDIA CUDA Cores 20,480 (total)
CPU Intel Xeon E5-2698 v4 2.2 GHz (20-core)
System Memory 256 GB RDIMM DDR4
StorageData: 3 x 1.92 TB SSD RAID 0
OS: 1 x 1.92 TB SSD
Network Dual 10GBASE-T LAN (RJ45)
Display 3x DisplayPort, 4K Resolution
Additional Ports 2x eSATA, 2x USB 3.1, 4x USB 3.0
Acoustics < 35 dB
Maximum Power Requirements 1500 W
Operating Temperature Range 10 - 30 oC
Software
Ubuntu Desktop Linux OS
DGX Recommended GPU Driver
CUDA Toolkit
42
DGX STATION SPECIFICATIONS
![Page 43: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched](https://reader030.vdocuments.us/reader030/viewer/2022040210/5e4dea154781d06aeb0467d1/html5/thumbnails/43.jpg)
43
NVIDIA DGX-2THE WORLD’S MOST POWERFUL DEEP LEARNING SYSTEM FOR THE MOST COMPLEX DEEP LEARNING CHALLENGES
• First 2 PFLOPS System
• 16 V100 32GB GPUs Fully Interconnected
• NVSwitch: 2.4 TB/s bisection bandwidth
• 24X GPU-GPU Bandwidth
• 0.5 TB of Unified GPU Memory
• 10X Deep Learning Performance
43
![Page 44: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched](https://reader030.vdocuments.us/reader030/viewer/2022040210/5e4dea154781d06aeb0467d1/html5/thumbnails/44.jpg)
44
DESIGNED TO TRAIN THE PREVIOUSLY IMPOSSIBLE
1
2
3
5
4
6 Two Intel Xeon Platinum CPUs
7 1.5 TB System Memory
44
30 TB NVME SSDs Internal Storage
NVIDIA Tesla V100 32GB
Two GPU Boards8 V100 32GB GPUs per board6 NVSwitches per board512GB Total HBM2 Memoryinterconnected byPlane Card
Twelve NVSwitches2.4 TB/sec bi-section
bandwidth
Eight EDR Infiniband/100 GigE1600 Gb/sec Total Bi-directional Bandwidth
PCIe Switch Complex
8
9
9Dual 10/25 Gb/secEthernet
![Page 45: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched](https://reader030.vdocuments.us/reader030/viewer/2022040210/5e4dea154781d06aeb0467d1/html5/thumbnails/45.jpg)
45
NVSWITCHWORLD’S HIGHEST BANDWIDTH ON-NODE SWITCH
7.2 Terabits/sec or 900 GB/sec
18 NVLINK ports | 50GB/s per port bi-directional
Fully-connected crossbar
2 billion transistors | 47.5mm x 47.5mm package
![Page 46: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched](https://reader030.vdocuments.us/reader030/viewer/2022040210/5e4dea154781d06aeb0467d1/html5/thumbnails/46.jpg)
46
NVSWITCHENABLES THE WORLD’S LARGEST GPU
16 Tesla V100 32GB Connected by New NVSwitch
2 petaFLOPS of DL Compute
Unified 512GB HBM2 GPU Memory Space
300GB/sec Every GPU-to-GPU
2.4TB/sec of Total Cross-section Bandwidth
![Page 47: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched](https://reader030.vdocuments.us/reader030/viewer/2022040210/5e4dea154781d06aeb0467d1/html5/thumbnails/47.jpg)
47
Software system
![Page 48: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched](https://reader030.vdocuments.us/reader030/viewer/2022040210/5e4dea154781d06aeb0467d1/html5/thumbnails/48.jpg)
48
Virtual Machine vs. Container
Not so similar
Docker VS VM
![Page 49: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched](https://reader030.vdocuments.us/reader030/viewer/2022040210/5e4dea154781d06aeb0467d1/html5/thumbnails/49.jpg)
49
COMMON SOFTWARE STACK ACROSS DGX FAMILY
Cloud Service Provider
• Single, unified stack for deep learning frameworks
• Predictable execution across platforms
• Pervasive reach
DGX Station DGX-1
NVIDIAGPU Cloud
DGX-2
49
![Page 50: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched](https://reader030.vdocuments.us/reader030/viewer/2022040210/5e4dea154781d06aeb0467d1/html5/thumbnails/50.jpg)
50
NGC GPU-OPTIMIZED DEEPLEARNING CONTAINERS
NVCaffe
Caffe2
Microsoft Cognitive Toolkit (CNTK)
DIGITS
MXNet
PyTorch
TensorFlow
Theano
Torch
CUDA (base level container for developers)
NEW! – NVIDIA TensorRT inference accelerator with ONNX support
A Comprehensive Catalog of Deep Learning Software
![Page 51: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched](https://reader030.vdocuments.us/reader030/viewer/2022040210/5e4dea154781d06aeb0467d1/html5/thumbnails/51.jpg)
51
VIRTUAL WORKSTATIONS AND PCs
![Page 52: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched](https://reader030.vdocuments.us/reader030/viewer/2022040210/5e4dea154781d06aeb0467d1/html5/thumbnails/52.jpg)
软件虚拟GPU GPU透传 GPU共享 vGPU
GPU
User
User
User
User
User
User
User
User
User
User
User
User
GPU
VM
Driver
VM
Driver
VM
Driver
VM
Driver
VM
Driver
vGPU vGPU vGPU vGPU vGPU
Hypervisor
VMDriver
VMDriver
VMDriver
GPU GPU GPU
Hypervisor
VMDriver
VMDriver
VMDriver
Hypervisor
常见GPU解决方案介绍
Driver
Windows Server + XenApp
User Session
![Page 53: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched](https://reader030.vdocuments.us/reader030/viewer/2022040210/5e4dea154781d06aeb0467d1/html5/thumbnails/53.jpg)
GPU虚拟(VGPU)- GPU一对多虚拟切分
服务器
GPU
1
GPU
2
GPU
1
GPU
2
虚拟GPU
VM
NVIDIA Driver
VM
NVIDIA Driver
VM
NVIDIA Driver
VM
NVIDIA Driver
VM
NVIDIA Driver
VM
NVIDIA Driver
VM
NVIDIA Driver
VM
NVIDIA Driver
VM
NVIDIA Driver
VM
NVIDIA Driver
VM
NVIDIA Driver
VM
NVIDIA Driver
VM
NVIDIA Driver
VM
NVIDIA Driver
VM
NVIDIA Driver
VM
NVIDIA Driver
vGPUvGPU vGPU vGPU vGPU vGPU vGPU vGPU vGPU vGPU vGPU vGPU vGPU vGPU vGPU vGPU
![Page 54: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched](https://reader030.vdocuments.us/reader030/viewer/2022040210/5e4dea154781d06aeb0467d1/html5/thumbnails/54.jpg)
理解VGPU在虚拟化平台上是如何切割的?
四分之一切割
八分之一切割
vGPU切割特点:单GPU不支持多种切割类型、虚拟机关闭vGPU资源释放
![Page 55: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched](https://reader030.vdocuments.us/reader030/viewer/2022040210/5e4dea154781d06aeb0467d1/html5/thumbnails/55.jpg)
NVIDIA GPU虚拟化解决方案发展历史通过软件更新来实现增值
2015.8 2016.8 2017.12
特性
/功能
Kepler架构
K1, K2 GPU8:1 vGPUOGL/DX
XenServer 6.2 SP1
Maxwell架构
M10 GPU支持主机和虚拟机级别的资源监控支持Citrix Desktop
director支持Windows Server
2016 VM支持vSphere 6.5
支持XenServer 7.1/2支持RHEL KVM GPU
透传方案
Pascal架构更名为Virtual GPUP4, P6, P40, P100
CUDA/OGL/DX2倍性能提升
24:1 vGPUNutanix KVM
支持Linux硬编码应用程序级别监控
License授权模式(vApp vPC vDWS)支持VMware vRops新增两种GPU调度
方式License HA部署模式支持大于1TB内存
2013.12 2016.4
Maxwell架构
M6, M60 GPU16:1 vGPU
OGL/DXVMware vSphere6.0
Huawei UVP
License授权模式(vPC vWS vWS ext)支持 Windows 10
Maxwell架构支持4K显示
引入vApp授权模式(vApp vPC vWS)
支持DX12GRID 1.x
GRID 2.x
GRID 3.x
GRID 4.x
Virtual GPU 5.x
2018.10
Volta架构支持V100 16/32GB
PCIE/SXM2 GPU32:1 vGPU
新增支持RHEL 7.5/RHV 4.2 KVM,Sangfor VMP,H3C
CAS KVMvGPU Motion
vPC支持2GB显存vPC支持Linux OS
vPC支持4k显示*2,高清显示*4
Virtual GPU 6.x
![Page 56: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched](https://reader030.vdocuments.us/reader030/viewer/2022040210/5e4dea154781d06aeb0467d1/html5/thumbnails/56.jpg)
NVIDIA Virtual GPU 7.1 平台提供图形、计算和人工智能特性
支持、更新与维护
NVIDIA Tesla (数据中心 GPU)
NVIDIA Virtual GPU 软件
![Page 57: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched](https://reader030.vdocuments.us/reader030/viewer/2022040210/5e4dea154781d06aeb0467d1/html5/thumbnails/57.jpg)
VIRTUAL GPU vGPU 7.X 新特性Unprecedented Performance & Manageability
Multi-vGPU SupportWorld’s Most Powerful
Quadro vDWS
vMotion Support for vGPULive Migration of vGPUenabled VMs
Quadro vDWS & GRID
Tesla T4 GPU Support*Latest Generation Turing
Quadro vDWS
NGC with vGPUAvailable with vGPU
Quadro vDWS
FPO FPO
* Tesla T4 support coming with vGPU software 7.1 release
![Page 58: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched](https://reader030.vdocuments.us/reader030/viewer/2022040210/5e4dea154781d06aeb0467d1/html5/thumbnails/58.jpg)
更多种类的GPU选择
V100 P100 P40 P4 T4 M60 M10 M6 P6
GPUs / Board
(Architecture)
1
(Volta)
1
(Pascal)
1
(Pascal)
1
(Pascal)
1
(Turing)
2
(Maxwell)
4
(Maxwell)
1
(Maxwell)
1
(Pascal)
CUDA Cores 5,120 3,584 3,840 2,560 2,5604,096
(2,048 per GPU)
2,560
(640 per GPU)1,536 2,048
Memory Size32 GB/16 GB
HBM216 GB HBM2 24 GB GDDR5 8 GB GDDR5 16 GB GDDR5
16 GB GDDR5
(8 GB per GPU)
32 GB GDDR5
(8 GB per GPU)8 GB GDDR5 16 GB GDDR5
vGPU Profiles
1 GB, 2 GB, 4 GB,
8 GB, 16 GB,
32 GB
1 GB, 2 GB, 4
GB,
8 GB, 16 GB
1 GB, 2 GB, 3
GB,
4 GB, 6 GB, 8
GB,
12 GB, 24 GB
1 GB, 2 GB, 4
GB,
8 GB
1 GB, 2 GB, 4
GB, 8 GB, 16
GB
0.5 GB, 1 GB, 2
GB,
4 GB, 8 GB
0.5 GB, 1 GB, 2
GB,
4 GB, 8 GB
0.5 GB, 1 GB, 2
GB,
4 GB, 8 GB
1 GB, 2 GB, 4 GB,
8 GB, 16 GB
Form Factor
PCIe 3.0 Dual Slot
& SXM2
(rack servers)
PCIe 3.0 Dual
Slot
(rack servers)
PCIe 3.0 Dual
Slot
(rack servers)
PCIe 3.0 Single
Slot
(rack servers)
PCIe 3.0
Single Slot
(rack servers)
PCIe 3.0 Dual
Slot
(rack servers)
PCIe 3.0 Dual
Slot
(rack servers)
MXM
(blade servers)
MXM
(blade servers)
Power 250W/300W 250W 250W 75W 70W 300W (225W opt) 225W 100W (75W opt) 90W
Thermal passive passive passive passive passive active/passive passive bare board bare board
BLADEOptimized
PERFORMANCEOptimized
支持Tesla P、V、T全线产品,适用于不同用户场景
DENSITYOptimized
![Page 59: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched](https://reader030.vdocuments.us/reader030/viewer/2022040210/5e4dea154781d06aeb0467d1/html5/thumbnails/59.jpg)
THANKS