![Page 1: WHAT’S NEW IN CUDA 8 - NVIDIAon-demand.gputechconf.com/gtc/2016/webinar/whats-new-cuda-toolkit-8.pdfWHAT’S NEW IN CUDA 8 Why Should You Care >2X Run Computations Faster* Solve](https://reader034.vdocuments.us/reader034/viewer/2022042221/5ec73c4149d3a10a1b76e28d/html5/thumbnails/1.jpg)
Siddharth Sharma, Oct 2016
WHAT’S NEW IN CUDA 8
![Page 2: WHAT’S NEW IN CUDA 8 - NVIDIAon-demand.gputechconf.com/gtc/2016/webinar/whats-new-cuda-toolkit-8.pdfWHAT’S NEW IN CUDA 8 Why Should You Care >2X Run Computations Faster* Solve](https://reader034.vdocuments.us/reader034/viewer/2022042221/5ec73c4149d3a10a1b76e28d/html5/thumbnails/2.jpg)
2
WHAT’S NEW IN CUDA 8Why Should You Care
>2X
Run Computations Faster* Solve Larger Problems** Critical Path Analysis
* HOOMD Blue v1.3.3 Lennard-Jones liquid benchmark
•K80 and P100 (PCIe); Base clocks; 4 GPUs per PCIe root complex
• 2x K80 indicates 2-GPU configuration (or 1x K80 board)
• CUDA 8 GA with r361.79 (K80) and r361.93.02 (P100)
• Host System: Intel Xeon Broadwell dual socket 22-core E5-2699
[email protected] 3.6GHz Turbo with CentOS 7.2 x86-64 and 256GB memory
** HPGMG-FV benchmark
• K80 and P100 (PCIe); Base clocks; 4 GPUs per PCIe root complex
• 2x K80 indicates 2-GPU configuration (or 1x K80 board)
• CUDA 8 GA with r361.79 (K80) and r361.93.02 (P100)
• Host System: Intel Xeon Broadwell dual socket 22-core E5-2699 [email protected] 3.6GHz
Turbo with CentOS 7.2 x86-64 and 256GB memory
![Page 3: WHAT’S NEW IN CUDA 8 - NVIDIAon-demand.gputechconf.com/gtc/2016/webinar/whats-new-cuda-toolkit-8.pdfWHAT’S NEW IN CUDA 8 Why Should You Care >2X Run Computations Faster* Solve](https://reader034.vdocuments.us/reader034/viewer/2022042221/5ec73c4149d3a10a1b76e28d/html5/thumbnails/3.jpg)
3
WHAT’S NEW IN CUDA 8
LIBRARIES
• New nvGRAPH library
• Support for FP16, INT8
UNIFIED MEMORY
• Demand Paging
• New Tuning APIs
• Data Coherence & Atomics
PASCAL ARCHITECTURE
• NVLINK
• HBM2 Stacked Memory
• Page Migration Engine
DEVELOPER TOOLS
• Critical Path Analysis
• NVCC Compile Time
• OpenACC Profiling
![Page 4: WHAT’S NEW IN CUDA 8 - NVIDIAon-demand.gputechconf.com/gtc/2016/webinar/whats-new-cuda-toolkit-8.pdfWHAT’S NEW IN CUDA 8 Why Should You Care >2X Run Computations Faster* Solve](https://reader034.vdocuments.us/reader034/viewer/2022042221/5ec73c4149d3a10a1b76e28d/html5/thumbnails/4.jpg)
4
PASCAL ARCHITECTURE
![Page 5: WHAT’S NEW IN CUDA 8 - NVIDIAon-demand.gputechconf.com/gtc/2016/webinar/whats-new-cuda-toolkit-8.pdfWHAT’S NEW IN CUDA 8 Why Should You Care >2X Run Computations Faster* Solve](https://reader034.vdocuments.us/reader034/viewer/2022042221/5ec73c4149d3a10a1b76e28d/html5/thumbnails/5.jpg)
5
PASCAL ARCHITECTURE
Webinar: “Inside Pascal”
Mark Harris (NVIDIA), Lars Nyland (NVIDIA)
GPU Technical Conference 2016 - ID S6176
![Page 6: WHAT’S NEW IN CUDA 8 - NVIDIAon-demand.gputechconf.com/gtc/2016/webinar/whats-new-cuda-toolkit-8.pdfWHAT’S NEW IN CUDA 8 Why Should You Care >2X Run Computations Faster* Solve](https://reader034.vdocuments.us/reader034/viewer/2022042221/5ec73c4149d3a10a1b76e28d/html5/thumbnails/6.jpg)
6
CUDA 8 ON P100: >3X FASTER THAN CPUsApplicati
on T
hro
ughput
Speedup V
sCPU
Performance may vary based on OS and software
versions, and motherboard configuration
• K80 and P100 (PCIe); Base clocks; 4 GPUs per PCIe root complex
• CUDA 8 GA with r361.79 (K80) and r361.93.02 (P100)
• Host System: Intel Xeon Broadwell dual socket 22-core E5-2699 [email protected] 3.6GHz Turbo
with CentOS 7.2 x86-64 and 256GB memory
0x
5x
10x
15x
20x
25x
30x
35x
LAMMPS NAMD HPCG (128) HPCG (256) HOOMD Blue QUDA (Average) MiniFe
E5-2699v4 2x K80 1x P100 2x P100 4x P100 8x P100
![Page 7: WHAT’S NEW IN CUDA 8 - NVIDIAon-demand.gputechconf.com/gtc/2016/webinar/whats-new-cuda-toolkit-8.pdfWHAT’S NEW IN CUDA 8 Why Should You Care >2X Run Computations Faster* Solve](https://reader034.vdocuments.us/reader034/viewer/2022042221/5ec73c4149d3a10a1b76e28d/html5/thumbnails/7.jpg)
7
UNIFIED MEMORY
![Page 8: WHAT’S NEW IN CUDA 8 - NVIDIAon-demand.gputechconf.com/gtc/2016/webinar/whats-new-cuda-toolkit-8.pdfWHAT’S NEW IN CUDA 8 Why Should You Care >2X Run Computations Faster* Solve](https://reader034.vdocuments.us/reader034/viewer/2022042221/5ec73c4149d3a10a1b76e28d/html5/thumbnails/8.jpg)
8
UNIFIED MEMORY
Past Developer View
Implicit Memory Management
Starting with Kepler and CUDA 6
System Memory GPU Memory Unified Memory
Pascal GPUCPU Pascal GPUCPU
![Page 9: WHAT’S NEW IN CUDA 8 - NVIDIAon-demand.gputechconf.com/gtc/2016/webinar/whats-new-cuda-toolkit-8.pdfWHAT’S NEW IN CUDA 8 Why Should You Care >2X Run Computations Faster* Solve](https://reader034.vdocuments.us/reader034/viewer/2022042221/5ec73c4149d3a10a1b76e28d/html5/thumbnails/9.jpg)
9
APPLICATIONS: LARGE VARIATIONS IN DATASET SIZES
Ray-tracingLarger scenes to render
Graph AnalysisLarger datasets
CombustionMore species & improved accuracy
Quantum ChemistryLarger systems
![Page 10: WHAT’S NEW IN CUDA 8 - NVIDIAon-demand.gputechconf.com/gtc/2016/webinar/whats-new-cuda-toolkit-8.pdfWHAT’S NEW IN CUDA 8 Why Should You Care >2X Run Computations Faster* Solve](https://reader034.vdocuments.us/reader034/viewer/2022042221/5ec73c4149d3a10a1b76e28d/html5/thumbnails/10.jpg)
10
CUDA 8: PASCAL UNIFIED MEMORYEasier Memory Management, APIs for High Performance
ENABLE LARGE
DATA MODELS
Oversubscribe GPU memory
Allocate up to system memory size
TUNE
UNIFIED MEMORY
PERFORMANCE
APIs for Pre-fetching & Read duplication
Usage hints via cudaMemAdvise API
SIMPLER
DATA ACCESS
CPU/GPU Data coherence
Unified memory atomic operations
Allocate Beyond GPU Memory Size
Unified Memory
Pascal GPUCPU
CUDA 8
Unified Memory
![Page 11: WHAT’S NEW IN CUDA 8 - NVIDIAon-demand.gputechconf.com/gtc/2016/webinar/whats-new-cuda-toolkit-8.pdfWHAT’S NEW IN CUDA 8 Why Should You Care >2X Run Computations Faster* Solve](https://reader034.vdocuments.us/reader034/viewer/2022042221/5ec73c4149d3a10a1b76e28d/html5/thumbnails/11.jpg)
11
CUDA 8 UNIFIED MEMORY — EXAMPLE
64 GB unified memory allocation on P100 with 16 GB physical memory
Transparent – No API changes
Works on Pascal & future architectures
Allocating 4x more than P100 physical memory
void foo() {
// Allocate 64 GBchar *data;size_t size = 64*1024*1024*1024;cudaMallocManaged(&data, size);
}
![Page 12: WHAT’S NEW IN CUDA 8 - NVIDIAon-demand.gputechconf.com/gtc/2016/webinar/whats-new-cuda-toolkit-8.pdfWHAT’S NEW IN CUDA 8 Why Should You Care >2X Run Computations Faster* Solve](https://reader034.vdocuments.us/reader034/viewer/2022042221/5ec73c4149d3a10a1b76e28d/html5/thumbnails/12.jpg)
12
CUDA 8 UNIFIED MEMORY — EXAMPLE
Both CPU code and CUDA kernel accessing ‘data’ simultaneously
Possible with CUDA 8 unified memory on Pascal
Accessing data simultaneously by CPU and GPU codes
__global__ void mykernel(char *data) {data[1] = ‘g’;
}
void foo() {char *data;cudaMallocManaged(&data, 2);
mykernel<<<...>>>(data);// no synchronize heredata[0] = ‘c’;
cudaFree(data);}
Webinar: “CUDA 8 Unified Memory”
![Page 13: WHAT’S NEW IN CUDA 8 - NVIDIAon-demand.gputechconf.com/gtc/2016/webinar/whats-new-cuda-toolkit-8.pdfWHAT’S NEW IN CUDA 8 Why Should You Care >2X Run Computations Faster* Solve](https://reader034.vdocuments.us/reader034/viewer/2022042221/5ec73c4149d3a10a1b76e28d/html5/thumbnails/13.jpg)
13
>3X SPEEDUP WITH UNIFIED MEMORY
0
20
40
60
80
100
120
140
160
180
1.4 4.7 8.6 28.9 58.6
CPU only K40 P100 (Unified Memory) P100 (Unified Memory with hints)
HPG
MG
AM
R A
pplicati
on T
hro
ughput
(Million D
OF/s)
Working Set (GB)
All 5 levels fit in GPU memory
Only 2 levels fit
Only 1 level fits
P100 GPU Memory Size (16GB)
Performance may vary based on OS and software
versions, and motherboard configuration
• HPGMG AMR on 1xK40, 1xP100 (PCIe) with CUDA 8 (r361)
• CPU measurements with Intel Xeon Haswell dual socket 10-core E5-2650 [email protected] GHz 3.0 GHz Turbo, HT on
• Host System: Intel Xeon Haswell dual socket 16-cores E5-2630 [email protected] 3.2GHz Turbo
![Page 14: WHAT’S NEW IN CUDA 8 - NVIDIAon-demand.gputechconf.com/gtc/2016/webinar/whats-new-cuda-toolkit-8.pdfWHAT’S NEW IN CUDA 8 Why Should You Care >2X Run Computations Faster* Solve](https://reader034.vdocuments.us/reader034/viewer/2022042221/5ec73c4149d3a10a1b76e28d/html5/thumbnails/14.jpg)
14
LIBRARIES
![Page 15: WHAT’S NEW IN CUDA 8 - NVIDIAon-demand.gputechconf.com/gtc/2016/webinar/whats-new-cuda-toolkit-8.pdfWHAT’S NEW IN CUDA 8 Why Should You Care >2X Run Computations Faster* Solve](https://reader034.vdocuments.us/reader034/viewer/2022042221/5ec73c4149d3a10a1b76e28d/html5/thumbnails/15.jpg)
15
GRAPH ANALYTICS
… and much more: Parallel Computing, Recommender Systems, Fraud Detection, Voice Recognition, Text Understanding, Search
Insight from connections in big data
GENOMICSCYBER SECURITY /
NETWORK ANALYTICSSOCIAL NETWORK
ANALYSIS
Wikimedia Commons Circos.ca
![Page 16: WHAT’S NEW IN CUDA 8 - NVIDIAon-demand.gputechconf.com/gtc/2016/webinar/whats-new-cuda-toolkit-8.pdfWHAT’S NEW IN CUDA 8 Why Should You Care >2X Run Computations Faster* Solve](https://reader034.vdocuments.us/reader034/viewer/2022042221/5ec73c4149d3a10a1b76e28d/html5/thumbnails/16.jpg)
16https://developer.nvidia.com/nvGRAPH
nvGRAPH
Parallel Library for Interactive and High Throughput Graph Analytics
Solve graphs with up to 2.5 Billion edges on a single GPU (Tesla M40)
Includes — PageRank, Single Source Shortest Path and Single Source Widest Path algorithms
Semi-ring SPMV operations provides building blocks for graph traversal algorithms
GPU Accelerated Graph Analytics
PageRank Single Source
Shortest Path
Single Source
Widest Path
Search Robotic Path
Planning
IP Routing
Recommendation
Engines
Power Network
Planning
Chip Design / EDA
Social Ad
Placement
Logistics & Supply
Chain Planning
Traffic sensitive
routing
![Page 17: WHAT’S NEW IN CUDA 8 - NVIDIAon-demand.gputechconf.com/gtc/2016/webinar/whats-new-cuda-toolkit-8.pdfWHAT’S NEW IN CUDA 8 Why Should You Care >2X Run Computations Faster* Solve](https://reader034.vdocuments.us/reader034/viewer/2022042221/5ec73c4149d3a10a1b76e28d/html5/thumbnails/17.jpg)
17
> 200X SPEEDUP ON PAGERANK VS GALOIS
Performance may vary based on OS and software
versions, and motherboard configuration
• nvGRAPH on M40 (ECC ON, r352), P100 (r361), Base clocks, input and output data on device
• GraphMat, Galois (v2.3) on Intel Xeon Broadwell dual-socket 22-core/socket E5-2699 v4 @ 2.22GHz, 3.6GHz Turbo
• Comparing Average Time per Iteration (ms) for PageRank
• Host System: Intel Xeon Haswell single-socket 16-core E5-2698 v3 @ 2.3GHz, 3.6GHz Turbo
• CentOS 7.2 x86-64 with 128GB System Memory
Speedup v
s. G
alo
is
1x 1x0x
50x
100x
150x
200x
250x
PageRank PageRank
soc-LiveJournal Twitter
Galois GraphMat M40 P100
![Page 18: WHAT’S NEW IN CUDA 8 - NVIDIAon-demand.gputechconf.com/gtc/2016/webinar/whats-new-cuda-toolkit-8.pdfWHAT’S NEW IN CUDA 8 Why Should You Care >2X Run Computations Faster* Solve](https://reader034.vdocuments.us/reader034/viewer/2022042221/5ec73c4149d3a10a1b76e28d/html5/thumbnails/18.jpg)
18
HIGHER THROUGHPUT THROUGH LOWER PRECISION COMPUTATION
Deep LearningcuBLAS: FP16 and INT8 GEMMS
Radio AstronomycuFFT: native FP16 operations
Fluid DynamicscuSPARSE: FP16 CSRMV
![Page 19: WHAT’S NEW IN CUDA 8 - NVIDIAon-demand.gputechconf.com/gtc/2016/webinar/whats-new-cuda-toolkit-8.pdfWHAT’S NEW IN CUDA 8 Why Should You Care >2X Run Computations Faster* Solve](https://reader034.vdocuments.us/reader034/viewer/2022042221/5ec73c4149d3a10a1b76e28d/html5/thumbnails/19.jpg)
19
DEVELOPER TOOLS
![Page 20: WHAT’S NEW IN CUDA 8 - NVIDIAon-demand.gputechconf.com/gtc/2016/webinar/whats-new-cuda-toolkit-8.pdfWHAT’S NEW IN CUDA 8 Why Should You Care >2X Run Computations Faster* Solve](https://reader034.vdocuments.us/reader034/viewer/2022042221/5ec73c4149d3a10a1b76e28d/html5/thumbnails/20.jpg)
20
DEPENDENCY ANALYSIS
The longest running kernel is not always the most critical optimization target
Easily find the critical kernel to optimize
A wait
B wait
Kernel X Kernel Y
5% 40%
TimelineOptimize Here
CPU
GPU
![Page 21: WHAT’S NEW IN CUDA 8 - NVIDIAon-demand.gputechconf.com/gtc/2016/webinar/whats-new-cuda-toolkit-8.pdfWHAT’S NEW IN CUDA 8 Why Should You Care >2X Run Computations Faster* Solve](https://reader034.vdocuments.us/reader034/viewer/2022042221/5ec73c4149d3a10a1b76e28d/html5/thumbnails/21.jpg)
21
IDENTIFY BOTTLENECKS ON CRITICAL PATHVisual Profiler and NVPROF
Unguided Analysis
Dependency AnalysisFunctions on critical path
![Page 22: WHAT’S NEW IN CUDA 8 - NVIDIAon-demand.gputechconf.com/gtc/2016/webinar/whats-new-cuda-toolkit-8.pdfWHAT’S NEW IN CUDA 8 Why Should You Care >2X Run Computations Faster* Solve](https://reader034.vdocuments.us/reader034/viewer/2022042221/5ec73c4149d3a10a1b76e28d/html5/thumbnails/22.jpg)
22
IDENTIFY BOTTLENECKS ON CRITICAL PATHVisual profiler and NVPROF
Inbound dependencies
Launch copy_kernel MemCpy HtoD [sync]
Outbound dependencies
MemCpy DtoH [sync]
![Page 23: WHAT’S NEW IN CUDA 8 - NVIDIAon-demand.gputechconf.com/gtc/2016/webinar/whats-new-cuda-toolkit-8.pdfWHAT’S NEW IN CUDA 8 Why Should You Care >2X Run Computations Faster* Solve](https://reader034.vdocuments.us/reader034/viewer/2022042221/5ec73c4149d3a10a1b76e28d/html5/thumbnails/23.jpg)
23
OpenACC PROFILINGOpenACC->Driver API->Compute
correlation
OpenACC->Source Code correlation
OpenACCtimeline
OpenACCProperties
![Page 24: WHAT’S NEW IN CUDA 8 - NVIDIAon-demand.gputechconf.com/gtc/2016/webinar/whats-new-cuda-toolkit-8.pdfWHAT’S NEW IN CUDA 8 Why Should You Care >2X Run Computations Faster* Solve](https://reader034.vdocuments.us/reader034/viewer/2022042221/5ec73c4149d3a10a1b76e28d/html5/thumbnails/24.jpg)
24
PROFILE CPU CODE + GPU CODE IN VISUAL PROFILER
Profile execution times on host function calls
View CPU code function hierarchy
![Page 25: WHAT’S NEW IN CUDA 8 - NVIDIAon-demand.gputechconf.com/gtc/2016/webinar/whats-new-cuda-toolkit-8.pdfWHAT’S NEW IN CUDA 8 Why Should You Care >2X Run Computations Faster* Solve](https://reader034.vdocuments.us/reader034/viewer/2022042221/5ec73c4149d3a10a1b76e28d/html5/thumbnails/25.jpg)
25
PROFILE UNIFIED MEMORY
Webinar: “CUDA 8 Tools Webinar”
MONITOR NVLINK BANDWIDTH
![Page 26: WHAT’S NEW IN CUDA 8 - NVIDIAon-demand.gputechconf.com/gtc/2016/webinar/whats-new-cuda-toolkit-8.pdfWHAT’S NEW IN CUDA 8 Why Should You Care >2X Run Computations Faster* Solve](https://reader034.vdocuments.us/reader034/viewer/2022042221/5ec73c4149d3a10a1b76e28d/html5/thumbnails/26.jpg)
26
COMPILE NVCC 2X FASTERImproved Developer Productivity
Performance may vary based on OS and software
versions, and motherboard configuration
• Average total compile times (per translation unit)
• Host system: Intel Core i7-3930K 6-cores @ 3.2GHz
• CentOS x86_64 Linux release 7.1.1503 (Core) with GCC 4.8.3 20140911
• GPU target architecture sm_52
Webinar: “CUDA 8 Performance Report”
![Page 27: WHAT’S NEW IN CUDA 8 - NVIDIAon-demand.gputechconf.com/gtc/2016/webinar/whats-new-cuda-toolkit-8.pdfWHAT’S NEW IN CUDA 8 Why Should You Care >2X Run Computations Faster* Solve](https://reader034.vdocuments.us/reader034/viewer/2022042221/5ec73c4149d3a10a1b76e28d/html5/thumbnails/27.jpg)
27
NEW PLATFORMS SUPPORTED
Platform Operating Systems Compilers
Windows Windows Server 2012 R2 Microsoft Visual Studio 2015 Update 3
and VS Community 2015
Linux Fedora 23, Ubuntu 16.04, SLES 1 PGI C++ 16.1/16.4, Clang 3.7, ICC 16.0
MAC OS X 10.12 GCC 5.x
![Page 28: WHAT’S NEW IN CUDA 8 - NVIDIAon-demand.gputechconf.com/gtc/2016/webinar/whats-new-cuda-toolkit-8.pdfWHAT’S NEW IN CUDA 8 Why Should You Care >2X Run Computations Faster* Solve](https://reader034.vdocuments.us/reader034/viewer/2022042221/5ec73c4149d3a10a1b76e28d/html5/thumbnails/28.jpg)
28
WHAT’S NEW IN CUDA 8
LIBRARIES
• New nvGRAPH library
• Support for FP16, INT8
UNIFIED MEMORY
• Demand Paging
• New Tuning APIs
• Data Coherence & Atomics
PASCAL ARCHITECTURE
• NVLINK
• HBM2 Stacked Memory
• Page Migration Engine
DEVELOPER TOOLS
• Critical Path Analysis
• NVCC Compile Time
• OpenACC Profiling
![Page 29: WHAT’S NEW IN CUDA 8 - NVIDIAon-demand.gputechconf.com/gtc/2016/webinar/whats-new-cuda-toolkit-8.pdfWHAT’S NEW IN CUDA 8 Why Should You Care >2X Run Computations Faster* Solve](https://reader034.vdocuments.us/reader034/viewer/2022042221/5ec73c4149d3a10a1b76e28d/html5/thumbnails/29.jpg)
29
CUDA 8 – DOWNLOAD TODAY!
• CUDA Applications in your Industry: www.nvidia.com/object/gpu-applications-domain.htm
• Additional Webinars:
• Inside PASCAL
• CUDA 8 Performance Report
• CUDA 8 Tools
• CUDA 8 Unified Memory
• CUDA 8 Release Notes: www.docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#abstract
Everything You Need to Accelerate Applications
developer.nvidia.com/cuda-toolkit
![Page 30: WHAT’S NEW IN CUDA 8 - NVIDIAon-demand.gputechconf.com/gtc/2016/webinar/whats-new-cuda-toolkit-8.pdfWHAT’S NEW IN CUDA 8 Why Should You Care >2X Run Computations Faster* Solve](https://reader034.vdocuments.us/reader034/viewer/2022042221/5ec73c4149d3a10a1b76e28d/html5/thumbnails/30.jpg)
THANK YOU