![Page 1: Persistent RNNs - (stashing recurrent weights on-chip)on-demand.gputechconf.com/...diamos-persisten-rnns.pdf · Persistent RNNs (stashing recurrent weights on-chip) Gregory Diamos](https://reader030.vdocuments.us/reader030/viewer/2022041110/5f0e98337e708231d43ffde0/html5/thumbnails/1.jpg)
Persistent RNNs(stashing recurrent weights on-chip)
Gregory Diamos
Baidu SVAIL
April 7, 2016
Gregory Diamos Persistent RNNs
![Page 2: Persistent RNNs - (stashing recurrent weights on-chip)on-demand.gputechconf.com/...diamos-persisten-rnns.pdf · Persistent RNNs (stashing recurrent weights on-chip) Gregory Diamos](https://reader030.vdocuments.us/reader030/viewer/2022041110/5f0e98337e708231d43ffde0/html5/thumbnails/2.jpg)
SVAIL
Think hard AI.
Goal
Develop hard AI technologies that impact 100 million users.
Gregory Diamos Persistent RNNs
![Page 3: Persistent RNNs - (stashing recurrent weights on-chip)on-demand.gputechconf.com/...diamos-persisten-rnns.pdf · Persistent RNNs (stashing recurrent weights on-chip) Gregory Diamos](https://reader030.vdocuments.us/reader030/viewer/2022041110/5f0e98337e708231d43ffde0/html5/thumbnails/3.jpg)
Deep Learning at SVAIL
reco
gnitio
n a
ccura
cy
data and compute
state of the art
human level
100 GFLOP/s1 laptop
6 TFLOP/s1 GPU
800 TFLOP/s128 GPUs
100 PFLOP/s16K GPUs
many previous methods
deep learning
Hypothesis: deep learning scales with data and compute.
Can we strong scale deep learning to the limits of technology?
Gregory Diamos Persistent RNNs
![Page 4: Persistent RNNs - (stashing recurrent weights on-chip)on-demand.gputechconf.com/...diamos-persisten-rnns.pdf · Persistent RNNs (stashing recurrent weights on-chip) Gregory Diamos](https://reader030.vdocuments.us/reader030/viewer/2022041110/5f0e98337e708231d43ffde0/html5/thumbnails/4.jpg)
Persistent RNNs
30x speedup at a mini-batch size of 4
Why is reducing the mini-batch size important?
Train bigger and deeper models.
Strong scale to more GPUs.
Improve efficiency of deployed models.
Gregory Diamos Persistent RNNs
![Page 5: Persistent RNNs - (stashing recurrent weights on-chip)on-demand.gputechconf.com/...diamos-persisten-rnns.pdf · Persistent RNNs (stashing recurrent weights on-chip) Gregory Diamos](https://reader030.vdocuments.us/reader030/viewer/2022041110/5f0e98337e708231d43ffde0/html5/thumbnails/5.jpg)
Training Deep RNNs
Gregory Diamos Persistent RNNs
![Page 6: Persistent RNNs - (stashing recurrent weights on-chip)on-demand.gputechconf.com/...diamos-persisten-rnns.pdf · Persistent RNNs (stashing recurrent weights on-chip) Gregory Diamos](https://reader030.vdocuments.us/reader030/viewer/2022041110/5f0e98337e708231d43ffde0/html5/thumbnails/6.jpg)
Deep speech
Near human level speech recognition in Mandarin and English
Trained on over 10,000 hours (about 1 year) of speech data.
20 ExaFLOPs of work to train (7 days on 16 GPUs at 40% of peak).
Gregory Diamos Persistent RNNs
![Page 7: Persistent RNNs - (stashing recurrent weights on-chip)on-demand.gputechconf.com/...diamos-persisten-rnns.pdf · Persistent RNNs (stashing recurrent weights on-chip) Gregory Diamos](https://reader030.vdocuments.us/reader030/viewer/2022041110/5f0e98337e708231d43ffde0/html5/thumbnails/7.jpg)
Data parallel training
...
GPU 0
GPU 1
mini-batchspeech data
Data parallelism:
The training data is grouped into mini-batches.
Each GPU trains a copy of the model on a slice of the mini-batch.
GPUs synchronize their models after a fixed number of steps.
Gregory Diamos Persistent RNNs
![Page 8: Persistent RNNs - (stashing recurrent weights on-chip)on-demand.gputechconf.com/...diamos-persisten-rnns.pdf · Persistent RNNs (stashing recurrent weights on-chip) Gregory Diamos](https://reader030.vdocuments.us/reader030/viewer/2022041110/5f0e98337e708231d43ffde0/html5/thumbnails/8.jpg)
Mini-batch constraints
So how should you choose the mini-batch size?
wal
l-cl
ock
tim
e to
con
verg
ence
mini-batch size
inefficient hardware inefficient optimization
64 per GPU 1024
Hardware efficiency will set a lower bound.
Optimization efficiency will set an upper bound.
Shrinking the mini-batch per GPU enables the use of more GPUs.
Gregory Diamos Persistent RNNs
![Page 9: Persistent RNNs - (stashing recurrent weights on-chip)on-demand.gputechconf.com/...diamos-persisten-rnns.pdf · Persistent RNNs (stashing recurrent weights on-chip) Gregory Diamos](https://reader030.vdocuments.us/reader030/viewer/2022041110/5f0e98337e708231d43ffde0/html5/thumbnails/9.jpg)
Determining the batch size
The upper bound can be found empirically.
In general a hyperparameter search is needed, but a useful heuristic is:
momentum = 1.0 − miniBatchSizewindowSize
learningRate = stepSize ∗ (1.0 −momentum) ∗miniBatchSize
Gregory Diamos Persistent RNNs
![Page 10: Persistent RNNs - (stashing recurrent weights on-chip)on-demand.gputechconf.com/...diamos-persisten-rnns.pdf · Persistent RNNs (stashing recurrent weights on-chip) Gregory Diamos](https://reader030.vdocuments.us/reader030/viewer/2022041110/5f0e98337e708231d43ffde0/html5/thumbnails/10.jpg)
Persistent RNN Details
Gregory Diamos Persistent RNNs
![Page 11: Persistent RNNs - (stashing recurrent weights on-chip)on-demand.gputechconf.com/...diamos-persisten-rnns.pdf · Persistent RNNs (stashing recurrent weights on-chip) Gregory Diamos](https://reader030.vdocuments.us/reader030/viewer/2022041110/5f0e98337e708231d43ffde0/html5/thumbnails/11.jpg)
RNN primer
RNNs built on GEMM calls reload the weights (U) each timestep.
However, the weights are constant, and this is wasteful.
Gregory Diamos Persistent RNNs
![Page 12: Persistent RNNs - (stashing recurrent weights on-chip)on-demand.gputechconf.com/...diamos-persisten-rnns.pdf · Persistent RNNs (stashing recurrent weights on-chip) Gregory Diamos](https://reader030.vdocuments.us/reader030/viewer/2022041110/5f0e98337e708231d43ffde0/html5/thumbnails/12.jpg)
Caching weights in registers
380 GB/s300 ns5.5 MB 6.144 TFLOP/s
128 GB/s30 ns230 KB 256 GFLOP/s
16 GB/s6 ns896 B 2 GFLOP/s
x 24
x 128
x 1GPU
Core
Thread
Off-chip memory is much slower and less efficient than registers.
GPUs have more on-chip memory in registers than anywhere else.
Cache RNN weights in registers and reuse them over timesteps.
Gregory Diamos Persistent RNNs
![Page 13: Persistent RNNs - (stashing recurrent weights on-chip)on-demand.gputechconf.com/...diamos-persisten-rnns.pdf · Persistent RNNs (stashing recurrent weights on-chip) Gregory Diamos](https://reader030.vdocuments.us/reader030/viewer/2022041110/5f0e98337e708231d43ffde0/html5/thumbnails/13.jpg)
Choosing the tile sizes
Recurrent Weight Matrix
SM0
SM1
SM23
...
1152
1152
1152
48
Warp0
Warp1
Warp7
1152
6
Thre
ad0
3
2
Thre
ad1
Thre
ad14
Thre
ad15
Thre
ad30
Thre
ad31
Thre
ad15
Thre
ad0
Thre
ad31
Thre
ad16
Thre
ad16
Thre
ad17
... ...
...
Block rows avoid additional inter-CTA synchronizations.
Each SM loads the activations into shared memory.
Threads are interleaved to avoid shared memory bank conflicts.
Vector loads and broadcasts amplify shared memory bandwidth.
Gregory Diamos Persistent RNNs
![Page 14: Persistent RNNs - (stashing recurrent weights on-chip)on-demand.gputechconf.com/...diamos-persisten-rnns.pdf · Persistent RNNs (stashing recurrent weights on-chip) Gregory Diamos](https://reader030.vdocuments.us/reader030/viewer/2022041110/5f0e98337e708231d43ffde0/html5/thumbnails/14.jpg)
Global barriers on GPUs
Grid of cooperative thread arrays Cooperative Thread Array
Kernel launch
barrier
barrier
divergentbranch
Grid of cooperative thread arrays Cooperative Thread Array
Persistent kernel launch
barrier
divergentbranch
globalbarrier
An inter-CTA barrier is implemented with a counting semaphore.
Uses atomic, membar, and cache modified load/store operations.
Completes in about 500ns on a TitanX GPU.
Disclaimer: global barriers violate the CUDA 7.5 model.
CUDA does not guarantee forward progress of multiple CTAs.
Our system implements cooperative threading for correctness.
Gregory Diamos Persistent RNNs
![Page 15: Persistent RNNs - (stashing recurrent weights on-chip)on-demand.gputechconf.com/...diamos-persisten-rnns.pdf · Persistent RNNs (stashing recurrent weights on-chip) Gregory Diamos](https://reader030.vdocuments.us/reader030/viewer/2022041110/5f0e98337e708231d43ffde0/html5/thumbnails/15.jpg)
Software pipelining
load math reduce barrier
load math reduce barrier
load math reduce barrier
load math reduce barrier
load math reduce barrier
load math reduce
load math
load
mini-batch 0
mini-batch 1
mini-batch 2
mini-batch 3
barrier
reduce barrier
math reduce barrier
load math reduce barrier
barrier
reduce barrier
math reduce barrier
load math reduce
load math
load
...
i0 i1 i2 i3 i4 i5 i6 i7
timestep0 timestep1 timestepn-1
i4n-4 i4n-3 i4n-2 i4n-1 i4n i4n+1 i4n+2
Software pipelining is used to hide latency.
Thread local math (430ns).
Intra-SM reduction (320ns).
Global loads (315ns).
Global barrier (500ns).
These are grouped into 4 pipeline stages, kept full with a minibatch of 4.
Gregory Diamos Persistent RNNs
![Page 16: Persistent RNNs - (stashing recurrent weights on-chip)on-demand.gputechconf.com/...diamos-persisten-rnns.pdf · Persistent RNNs (stashing recurrent weights on-chip) Gregory Diamos](https://reader030.vdocuments.us/reader030/viewer/2022041110/5f0e98337e708231d43ffde0/html5/thumbnails/16.jpg)
Strong Scaling
Gregory Diamos Persistent RNNs
![Page 17: Persistent RNNs - (stashing recurrent weights on-chip)on-demand.gputechconf.com/...diamos-persisten-rnns.pdf · Persistent RNNs (stashing recurrent weights on-chip) Gregory Diamos](https://reader030.vdocuments.us/reader030/viewer/2022041110/5f0e98337e708231d43ffde0/html5/thumbnails/17.jpg)
Scaling to 128 GPUs
Scaling results for end-to-end model training.
8 GPUs per node, 7GB/s infiniband between nodes.
The algorithmic mini-batch size is fixed at 512.
0 20 40 60 80 100 120 140
GPU Count
0
50
100
150
200
250
300Tera
FLO
P/s
Deep Speech Scaling With 1152 Unit LayersPERSISTENT-RNNGEMM-RNNPERFECT SCALING
A smaller mini-batch per GPU enables the use of up to 128 GPUs.
Gregory Diamos Persistent RNNs
![Page 18: Persistent RNNs - (stashing recurrent weights on-chip)on-demand.gputechconf.com/...diamos-persisten-rnns.pdf · Persistent RNNs (stashing recurrent weights on-chip) Gregory Diamos](https://reader030.vdocuments.us/reader030/viewer/2022041110/5f0e98337e708231d43ffde0/html5/thumbnails/18.jpg)
Exploring deep residual RNNs
Using a mini-batch per GPU of 4 provides a 16x reduction in memory.
Models with more parameters can now fit into GPU memory.
0 10 20 30 40 50 60 70 80 90
Recurrent Layer Count
27
28
29
30
31
32
33
34
35
36
Word
Err
or
Rate
(Englis
h)
Deep Residual Network Error Rate Reduction With Depth
Deep Residual RNN
Results suggest that residual skip connections networks apply to RNNs.
Gregory Diamos Persistent RNNs
![Page 19: Persistent RNNs - (stashing recurrent weights on-chip)on-demand.gputechconf.com/...diamos-persisten-rnns.pdf · Persistent RNNs (stashing recurrent weights on-chip) Gregory Diamos](https://reader030.vdocuments.us/reader030/viewer/2022041110/5f0e98337e708231d43ffde0/html5/thumbnails/19.jpg)
Pascal and future
Future GPUs will enable bigger and faster RNN layers.
bigger GPUs (more threads, more registers)
low latency atomics between GPUs (NvLink)
lower precision (fp16)
Gregory Diamos Persistent RNNs
![Page 20: Persistent RNNs - (stashing recurrent weights on-chip)on-demand.gputechconf.com/...diamos-persisten-rnns.pdf · Persistent RNNs (stashing recurrent weights on-chip) Gregory Diamos](https://reader030.vdocuments.us/reader030/viewer/2022041110/5f0e98337e708231d43ffde0/html5/thumbnails/20.jpg)
Conclusions
So far, deep learning for speech recognition has scaled with compute.
reco
gnitio
n a
ccura
cy
data and compute
state of the art
human level
100 GFLOP/s1 laptop
6 TFLOP/s1 GPU
800 TFLOP/s128 GPUs
100 PFLOP/s16K GPUs
many previous methods
deep learning
Persistent kernels provide a new tool for accelerating RNN training.
Let’s continue building faster computers, software, and algorithms.
What other hard AI problems will scale with deep learning and compute?
Gregory Diamos Persistent RNNs
![Page 21: Persistent RNNs - (stashing recurrent weights on-chip)on-demand.gputechconf.com/...diamos-persisten-rnns.pdf · Persistent RNNs (stashing recurrent weights on-chip) Gregory Diamos](https://reader030.vdocuments.us/reader030/viewer/2022041110/5f0e98337e708231d43ffde0/html5/thumbnails/21.jpg)
Questions
Questions?
Contact Me:
Gregory Diamos - [email protected]
Baidu USA is hiring!
http://usa.baidu.com/careers/
Gregory Diamos Persistent RNNs