![Page 1: THE NVIDIA DEEP LEARNING ACCELERATOR · 2018-08-19 · Encourage Deep Learning applications Invite contributions from the community ... ©2018 NVIDIA CORPORATION ©2018 NVIDIA CORPORATION](https://reader034.vdocuments.us/reader034/viewer/2022042223/5ec98cad677e3c7a135932be/html5/thumbnails/1.jpg)
THE NVIDIA DEEP LEARNING ACCELERATOR
![Page 2: THE NVIDIA DEEP LEARNING ACCELERATOR · 2018-08-19 · Encourage Deep Learning applications Invite contributions from the community ... ©2018 NVIDIA CORPORATION ©2018 NVIDIA CORPORATION](https://reader034.vdocuments.us/reader034/viewer/2022042223/5ec98cad677e3c7a135932be/html5/thumbnails/2.jpg)
2©2018 NVIDIA CORPORATION ©2018 NVIDIA CORPORATION
INTRODUCTION
Developed as part of Xavier – NVIDIA’s SOC for autonomous driving applications
Optimized for Convolutional Neural Networks (CNNs), computer vision
Open source architecture and RTL release
Encourage Deep Learning applications
Invite contributions from the community
Targeted towards edge devices, IoT
Industry standard formats and parameterized
NVDLA — NVIDIA Deep Learning Accelerator
![Page 3: THE NVIDIA DEEP LEARNING ACCELERATOR · 2018-08-19 · Encourage Deep Learning applications Invite contributions from the community ... ©2018 NVIDIA CORPORATION ©2018 NVIDIA CORPORATION](https://reader034.vdocuments.us/reader034/viewer/2022042223/5ec98cad677e3c7a135932be/html5/thumbnails/3.jpg)
3©2018 NVIDIA CORPORATION ©2018 NVIDIA CORPORATION
CNN INFERENCEConvolutional Neural Network
cat
conv1 conv2 full1 full2
Convolutional and fully connected layers
![Page 4: THE NVIDIA DEEP LEARNING ACCELERATOR · 2018-08-19 · Encourage Deep Learning applications Invite contributions from the community ... ©2018 NVIDIA CORPORATION ©2018 NVIDIA CORPORATION](https://reader034.vdocuments.us/reader034/viewer/2022042223/5ec98cad677e3c7a135932be/html5/thumbnails/4.jpg)
4©2018 NVIDIA CORPORATION ©2018 NVIDIA CORPORATION
CNN INFERENCEOne Convolutional Layer
Filter Weights
Convolution result
K filters
C
R
S
Output Activations
H’
K
W’W
H
K
Input Activations
C
W
H
convolution post-processing
![Page 5: THE NVIDIA DEEP LEARNING ACCELERATOR · 2018-08-19 · Encourage Deep Learning applications Invite contributions from the community ... ©2018 NVIDIA CORPORATION ©2018 NVIDIA CORPORATION](https://reader034.vdocuments.us/reader034/viewer/2022042223/5ec98cad677e3c7a135932be/html5/thumbnails/5.jpg)
5©2018 NVIDIA CORPORATION ©2018 NVIDIA CORPORATION
ARCHITECTURE OVERVIEWTop Level Architecture
SM SM SM SM
SDRAM Internal RAM
Configuration and control block
Post-processing
Memory interface
Input activations
Filter weights
Convolution core
Control Bus
Convolutional Buffer
![Page 6: THE NVIDIA DEEP LEARNING ACCELERATOR · 2018-08-19 · Encourage Deep Learning applications Invite contributions from the community ... ©2018 NVIDIA CORPORATION ©2018 NVIDIA CORPORATION](https://reader034.vdocuments.us/reader034/viewer/2022042223/5ec98cad677e3c7a135932be/html5/thumbnails/6.jpg)
6©2018 NVIDIA CORPORATION ©2018 NVIDIA CORPORATION
ARCHITECTURE OVERVIEW
Convolutional buffer size vs Memory Bandwidth trade off
If conv buffer can fit 1/N’th of total weights, activations need to be read N times
Example: GoogleNet layer inception 4a/3x3, 16-bit precision
Input activations: 1.2 MB
Filter weights: 360KB
If conv buffer is 128KB, then minimal bandwidth for activations is 3 x 1.2MB = 3.6MB
Convolutional buffer has to be multi-ported, internal RAM can be single ported
Memory Considerations
![Page 7: THE NVIDIA DEEP LEARNING ACCELERATOR · 2018-08-19 · Encourage Deep Learning applications Invite contributions from the community ... ©2018 NVIDIA CORPORATION ©2018 NVIDIA CORPORATION](https://reader034.vdocuments.us/reader034/viewer/2022042223/5ec98cad677e3c7a135932be/html5/thumbnails/7.jpg)
7©2018 NVIDIA CORPORATION ©2018 NVIDIA CORPORATION
CONVOLUTION CORE
Example: 64 activations x (4 filters x 64 weights) → 4 partial outputs
256 MACs per clock cycle
Atomic Operation (Each Clock Cycle)
Input activations Filter weight data
S
C
KW
C
H
W
H
![Page 8: THE NVIDIA DEEP LEARNING ACCELERATOR · 2018-08-19 · Encourage Deep Learning applications Invite contributions from the community ... ©2018 NVIDIA CORPORATION ©2018 NVIDIA CORPORATION](https://reader034.vdocuments.us/reader034/viewer/2022042223/5ec98cad677e3c7a135932be/html5/thumbnails/8.jpg)
8©2018 NVIDIA CORPORATION ©2018 NVIDIA CORPORATION
CONVOLUTIONAL CORE
Calculate one stripe of outputs at a time
Change activations first, then weights
In the slide show:
C-dimension not shown
Only one kernel shown
Order of Calculations
W
H
Kernel R=S=3
Input data
Switch C
Switch C
Output data
X
![Page 9: THE NVIDIA DEEP LEARNING ACCELERATOR · 2018-08-19 · Encourage Deep Learning applications Invite contributions from the community ... ©2018 NVIDIA CORPORATION ©2018 NVIDIA CORPORATION](https://reader034.vdocuments.us/reader034/viewer/2022042223/5ec98cad677e3c7a135932be/html5/thumbnails/9.jpg)
9©2018 NVIDIA CORPORATION ©2018 NVIDIA CORPORATION
CONVOLUTION CORE
More samples in C-direction per atomic operation:
Allows for wider Wallace tree (half rather than full adders)
Causes more wasted MACs on non-aligned boundaries
Keeping one operand (weights) of MACs constant for a number of cycles saves dynamic power and reduces data transfers
Calculate full sums rather than partial sums as latter require higher precision before rounding (better to stream/store inputs than outputs)
Area and Power Considerations
![Page 10: THE NVIDIA DEEP LEARNING ACCELERATOR · 2018-08-19 · Encourage Deep Learning applications Invite contributions from the community ... ©2018 NVIDIA CORPORATION ©2018 NVIDIA CORPORATION](https://reader034.vdocuments.us/reader034/viewer/2022042223/5ec98cad677e3c7a135932be/html5/thumbnails/10.jpg)
10©2018 NVIDIA CORPORATION
AREA, PERFORMANCE, AND POWERConfigurations
INT8 data path
1 RAM interface
No advanced features
SMALL CONFIGURATION
INT8, INT16, FP16 data path
2 RAM interfaces
Integrated controller
Weight compression
…
LARGE CONFIGURATION
![Page 11: THE NVIDIA DEEP LEARNING ACCELERATOR · 2018-08-19 · Encourage Deep Learning applications Invite contributions from the community ... ©2018 NVIDIA CORPORATION ©2018 NVIDIA CORPORATION](https://reader034.vdocuments.us/reader034/viewer/2022042223/5ec98cad677e3c7a135932be/html5/thumbnails/11.jpg)
11©2018 NVIDIA CORPORATION ©2018 NVIDIA CORPORATION
AREA, PERFORMANCE, POWERSmall Configurations (16nm, 1GHz)
INT8 MACs
(# instances)
Conv. Buffer
(KB)
Area
(mm2)
Memory BW
(GB/s)
ResNet50
Perf
(frames/s)
Power
(mW)
Power Eff.
(DL
TOPS/W)
2048 512 3.3 20 269 388 5.4
1024 256 1.8 15 153 185 6.3
512 256 1.4 10 93 107 6.8
256 256 1.0 5 46 64 5.6
128 256 0.84 2 20 41 3.8
64 128 0.55 1 7.3 28 2.0
Area is synthesis area + internal RAMs, does not account for layout inefficienciesPower is for DLA incl. internal RAMs, excluding SOC & external RAMsCalibrated to Xavier silicon - NVIDIA flows, libraries, RAM compilers, …DL TOPS == #convolutional MAC operations * 2
![Page 12: THE NVIDIA DEEP LEARNING ACCELERATOR · 2018-08-19 · Encourage Deep Learning applications Invite contributions from the community ... ©2018 NVIDIA CORPORATION ©2018 NVIDIA CORPORATION](https://reader034.vdocuments.us/reader034/viewer/2022042223/5ec98cad677e3c7a135932be/html5/thumbnails/12.jpg)
12©2018 NVIDIA CORPORATION ©2018 NVIDIA CORPORATION
AREA, PERFORMANCE, POWERLarge Configuration (16nm, 1GHz)
Area and power do not include Tightly Coupled Memory (TCM)
Data Type
Internal
RAM
Size
ResNet50
Perf
(frames/s)
Power
(mW)
Power Eff.
(DL
TOPS/W)
INT8 none 165 267 4.8
FP16 none 59 276 1.6
INT8 2M 230 348 5.1
FP16 2M 115 475 1.9
Configuration
INT16/FP16 512 MACs
INT8 1024 MACs
Conv Buffer 256 KB
Area 2.4 mm2
DRAM BW 15 GB/s
TCM R/W BW 25/25 GB/s
![Page 13: THE NVIDIA DEEP LEARNING ACCELERATOR · 2018-08-19 · Encourage Deep Learning applications Invite contributions from the community ... ©2018 NVIDIA CORPORATION ©2018 NVIDIA CORPORATION](https://reader034.vdocuments.us/reader034/viewer/2022042223/5ec98cad677e3c7a135932be/html5/thumbnails/13.jpg)
13©2018 NVIDIA CORPORATION ©2018 NVIDIA CORPORATION
SW ARCHITECTURE
Compile time
Run time
parser compiler
Caffe modelcompiler
params
loadable
Application
User Mode
Driver
Kernel Mode
Driver
DLA hardware
![Page 14: THE NVIDIA DEEP LEARNING ACCELERATOR · 2018-08-19 · Encourage Deep Learning applications Invite contributions from the community ... ©2018 NVIDIA CORPORATION ©2018 NVIDIA CORPORATION](https://reader034.vdocuments.us/reader034/viewer/2022042223/5ec98cad677e3c7a135932be/html5/thumbnails/14.jpg)
14©2018 NVIDIA CORPORATION ©2018 NVIDIA CORPORATION
NVDLA VIRTUAL PLATFORM
SW & RTL prototyping for all configs
Two formats for NVDLA + Memory
C-model
FPGA board
Easy access via cloud
Amazon Web Services (AWS)
OS, Application, UMD, KMD
QEMU CPU cluster
NVDLA
Memory
![Page 15: THE NVIDIA DEEP LEARNING ACCELERATOR · 2018-08-19 · Encourage Deep Learning applications Invite contributions from the community ... ©2018 NVIDIA CORPORATION ©2018 NVIDIA CORPORATION](https://reader034.vdocuments.us/reader034/viewer/2022042223/5ec98cad677e3c7a135932be/html5/thumbnails/15.jpg)
15©2018 NVIDIA CORPORATION ©2018 NVIDIA CORPORATION
OPEN SOURCE SOC PROTOTYPENVDLA + SiFive RISC-V
Demo at SiFive booth
NVDLA config
Small config
2048 MACs
512 KB
YOLOv3 object recognition
FPGANVDLA
Mem IF
DRAM DRAM
FPGA
I/O
inte
rfaces
RISC-V
CPU
Mem IF
![Page 16: THE NVIDIA DEEP LEARNING ACCELERATOR · 2018-08-19 · Encourage Deep Learning applications Invite contributions from the community ... ©2018 NVIDIA CORPORATION ©2018 NVIDIA CORPORATION](https://reader034.vdocuments.us/reader034/viewer/2022042223/5ec98cad677e3c7a135932be/html5/thumbnails/16.jpg)
16©2018 NVIDIA CORPORATION
VIDEO FILE
Inserting video: Insert/Video/Video from File.Insert video by browsing your directory and selecting OK.
File types that works best in PowerPoint are mp4 or wmv
![Page 17: THE NVIDIA DEEP LEARNING ACCELERATOR · 2018-08-19 · Encourage Deep Learning applications Invite contributions from the community ... ©2018 NVIDIA CORPORATION ©2018 NVIDIA CORPORATION](https://reader034.vdocuments.us/reader034/viewer/2022042223/5ec98cad677e3c7a135932be/html5/thumbnails/17.jpg)
17©2018 NVIDIA CORPORATION
CHECK IT OUT FOR YOURSELF
http://nvdla.org/index.html
Documentation and source code are available under permissive license
Community contributions under Contributor License Agreement are encouraged
OR CONTRIBUTE!
![Page 18: THE NVIDIA DEEP LEARNING ACCELERATOR · 2018-08-19 · Encourage Deep Learning applications Invite contributions from the community ... ©2018 NVIDIA CORPORATION ©2018 NVIDIA CORPORATION](https://reader034.vdocuments.us/reader034/viewer/2022042223/5ec98cad677e3c7a135932be/html5/thumbnails/18.jpg)