![Page 1: High-Performance GPU Programming for Deep Learning](https://reader034.vdocuments.us/reader034/viewer/2022042723/586f845d1a28ab54768b4ceb/html5/thumbnails/1.jpg)
High-Performance GPU Programming for Deep Learning
7 April 2016 Scott Gray
Nervana Systems
MAKING MACHINES SMARTER.™
![Page 2: High-Performance GPU Programming for Deep Learning](https://reader034.vdocuments.us/reader034/viewer/2022042723/586f845d1a28ab54768b4ceb/html5/thumbnails/2.jpg)
Proprietary and confidential. Do not distribute.ne r vana
High-Performance GPU kernels for deep learning
2
• Fast matrix multiply for small minibatches
• Direct convolution leveraging GEMM advances
• Even faster convolution with Winograd
![Page 3: High-Performance GPU Programming for Deep Learning](https://reader034.vdocuments.us/reader034/viewer/2022042723/586f845d1a28ab54768b4ceb/html5/thumbnails/3.jpg)
Proprietary and confidential. Do not distribute.ne r vana
GEMM: Basics
3
C = AB
![Page 4: High-Performance GPU Programming for Deep Learning](https://reader034.vdocuments.us/reader034/viewer/2022042723/586f845d1a28ab54768b4ceb/html5/thumbnails/4.jpg)
Proprietary and confidential. Do not distribute.ne r vana
GEMM: Memory Load
4
Outer product contiguous Outer product strided
threads
memory load
single tile
batched GEMM
![Page 5: High-Performance GPU Programming for Deep Learning](https://reader034.vdocuments.us/reader034/viewer/2022042723/586f845d1a28ab54768b4ceb/html5/thumbnails/5.jpg)
Proprietary and confidential. Do not distribute.ne r vana
Batched GEMM tiles 32 x 32GEMM tile 32 x 64GEMM tile 32 x 32
GEMM: Tile sizes
5
threads
shared memory load
![Page 6: High-Performance GPU Programming for Deep Learning](https://reader034.vdocuments.us/reader034/viewer/2022042723/586f845d1a28ab54768b4ceb/html5/thumbnails/6.jpg)
Proprietary and confidential. Do not distribute.ne r vana
hGEMM Results - NN
6
Nx3072x3072 NN op
0
1500
3000
4500
6000
32 64 96 128
Nervana 32x32 cuBLAS 128x64
Batch Size (N)
GFL
OPS
![Page 7: High-Performance GPU Programming for Deep Learning](https://reader034.vdocuments.us/reader034/viewer/2022042723/586f845d1a28ab54768b4ceb/html5/thumbnails/7.jpg)
Proprietary and confidential. Do not distribute.ne r vana
hGEMM Results - TN
7
GFL
OPS
Nx3072x3072 TN op
0
1500
3000
4500
6000
32 64 96 128
Nervana 32x32 cuBLAS 128x64
Batch Size (N)
![Page 8: High-Performance GPU Programming for Deep Learning](https://reader034.vdocuments.us/reader034/viewer/2022042723/586f845d1a28ab54768b4ceb/html5/thumbnails/8.jpg)
Proprietary and confidential. Do not distribute.ne r vana
Direct convolution is still relevant
8
• Striding
• Odd-size filters
• Placeholder until faster algo can be implemented
• Often faster for single image or first small C layer
![Page 9: High-Performance GPU Programming for Deep Learning](https://reader034.vdocuments.us/reader034/viewer/2022042723/586f845d1a28ab54768b4ceb/html5/thumbnails/9.jpg)
Proprietary and confidential. Do not distribute.ne r vana
Direct convolution: implementation details
9
• Batched GEMM for efficient transpose and higher occupancy
• Compound outer product block remapping
• Square wave pattern for P,Q block mapping
• Slicing: shared memory lookup + integer division
• N vs C contiguous
• Single P,Q vs tiled P,Q
• Bprop as upside down fprop
• Update specific optimizations
![Page 10: High-Performance GPU Programming for Deep Learning](https://reader034.vdocuments.us/reader034/viewer/2022042723/586f845d1a28ab54768b4ceb/html5/thumbnails/10.jpg)
Proprietary and confidential. Do not distribute.ne r vana
Winograd: input transform
10
Input Feature Map
4x4 stride 2• Input transform
• 2D Winograd is a nested
product of 1D transforms
• Transforms can be
simplified to remove zeros
![Page 11: High-Performance GPU Programming for Deep Learning](https://reader034.vdocuments.us/reader034/viewer/2022042723/586f845d1a28ab54768b4ceb/html5/thumbnails/11.jpg)
Proprietary and confidential. Do not distribute.ne r vana
Winograd: filter transform
11
• Filter transform
• Same as input but with
different coefficients
• Transform each feature map
independently
![Page 12: High-Performance GPU Programming for Deep Learning](https://reader034.vdocuments.us/reader034/viewer/2022042723/586f845d1a28ab54768b4ceb/html5/thumbnails/12.jpg)
Proprietary and confidential. Do not distribute.ne r vana
Winograd: batched GEMM
12
![Page 13: High-Performance GPU Programming for Deep Learning](https://reader034.vdocuments.us/reader034/viewer/2022042723/586f845d1a28ab54768b4ceb/html5/thumbnails/13.jpg)
Proprietary and confidential. Do not distribute.ne r vana
Winograd: output transform
13
Output Feature Map
• Output transform
• Same as input and filter
• Transform back to pixel
space to obtain 2x2 output
tile
![Page 14: High-Performance GPU Programming for Deep Learning](https://reader034.vdocuments.us/reader034/viewer/2022042723/586f845d1a28ab54768b4ceb/html5/thumbnails/14.jpg)
Proprietary and confidential. Do not distribute.ne r vana 14
Performance: VGG
VGG fp32 - Totals by operation
0
0.5
1
1.5
2
64 32 16 8 4 2 1
Winograd fp32 fpropWinograd fp32 bpropWinograd fp32 updatecuDNN fp32 fpropcuDNN fp32 bpropcuDNN fp32 update
Algo
rithm
ic S
peed
up
Batch Size
![Page 15: High-Performance GPU Programming for Deep Learning](https://reader034.vdocuments.us/reader034/viewer/2022042723/586f845d1a28ab54768b4ceb/html5/thumbnails/15.jpg)
Proprietary and confidential. Do not distribute.ne r vana
Performance: Alexnet convolutional layers
15
Alexnet Totals
0
0.5
1
1.5
2
128 64 32 16 8 4
Nervana fp16Nervana fp32CuBLAS fp16CuBLAS fp32
Batch Size
Algo
rithm
ic S
peed
up
![Page 16: High-Performance GPU Programming for Deep Learning](https://reader034.vdocuments.us/reader034/viewer/2022042723/586f845d1a28ab54768b4ceb/html5/thumbnails/16.jpg)
Proprietary and confidential. Do not distribute.ne r vana
Compounding
16
• alpha / beta
• bias
• relu, prelu, tanh, …
• bprop relu, …
• bprop bias
• batchnorm mean
Compounding inside of GEMM and conv for free:
![Page 17: High-Performance GPU Programming for Deep Learning](https://reader034.vdocuments.us/reader034/viewer/2022042723/586f845d1a28ab54768b4ceb/html5/thumbnails/17.jpg)
Proprietary and confidential. Do not distribute.ne r vana
Summary
17
• Nervana has the fastest tools for deep learning
• neon with state-of-the-art Maxwell kernels
• Nervana Cloud with multi-GPU training
• Watch for Nervana Engine, our deep learning processor