april 4-7, 2016 | silicon valley high performance ctc ... · for warp-ctc, this is the run time of...
TRANSCRIPT
April 4-7, 2016 | Silicon Valley
Minmin Sun, NVIDIA
April 5th
HIGH PERFORMANCE CTC TRAINING FOR END-TO-END SPEECH RECOGNITION ON GPU
2
AGENDA
Brief Introduction of CTC
Alpha/Beta Matrix Computation
Gradient Matrix Computation
Overall Performance
3
BRIEF INTRODUCTION OF CTC
4
BRIEF INTRODUCTION OF CTC Overview
CTC is a loss function to train the RNN
Inputs: (1) 𝑝--softmax output (2) label sequence
Output: 𝑔--gradient w.r.t. output layer
CTC Includes: (1)Alpha computation (2)Beta computation (3)Gradient computation
6/7/2016
Hidden layer
…. Hidden layer
Hidden layer
….
frame t-1 frame t frame t+1
h[t-1]
Output layer
y[t-1]
Softmax
𝑝[𝑡-1]
CTC
𝑔[𝑡-1]
Label sequence: ‘C’, ‘A’, ‘T’
h[t]
Output layer
y[t]
Softmax
𝑝[𝑡] 𝑔[𝑡]
h[t+1]
Output layer
y[t+1]
Softmax
𝑝[𝑡+1] 𝑔[𝑡+1]
RNN
5
BRIEF INTRODUCTION OF CTC Alpha/Beta Matrix Computation
Matrix Dim: 𝑇 rows * 𝑆 columns
𝑆 = 2𝐿 + 1 is the length of augmented label sequence 𝑙
𝐿 is the number of characters in the original label sequence
𝑇 is the number of time-steps in the utterance
6/7/2016
𝑖𝑓 𝑙 𝑠 = 𝑏𝑙𝑎𝑛𝑘 𝑜𝑟 𝑙 𝑠 = 𝑙 𝑠 − 2 ∶ 𝛼 𝑡 𝑠 = 𝛼 𝑡 − 1 𝑠 + 𝛼 𝑡 − 1 𝑠 − 1 ∗ 𝑝 𝑡 𝑙 𝑠 else : 𝛼 𝑡 𝑠 = 𝛼 𝑡 − 1 𝑠 + 𝛼 𝑡 − 1 𝑠 − 1 + 𝛼 𝑡 − 1 𝑠 − 2 ∗ 𝑝 𝑡 𝑙 𝑠
6
BRIEF INTRODUCTION OF CTC Alpha/Beta Matrix Computation
6/7/2016
𝛼 𝑡 − 1 𝑠 − 2 𝛼 𝑡 − 1 𝑠 − 1 𝛼 𝑡 − 1 𝑠
𝛼 𝑡 𝑠
𝒃𝒍𝒂𝒏𝒌 𝒄 𝒃𝒍𝒂𝒏𝒌 𝒂 𝒃𝒍𝒂𝒏𝒌 𝒕 𝒃𝒍𝒂𝒏𝒌
𝒍
𝜶 𝑠 − 2 𝑠 − 1 𝑠
𝑡 − 1
𝑡
7
BRIEF INTRODUCTION OF CTC Gradient Matrix Computation
Matrix Dim: 𝑇 rows * 𝐴 columns
𝐴 is the alphabet size, e.g. 28 for English
key-value reduction using the character 𝑙 𝑠 as key
6/7/2016
𝑔 𝑡 𝑎 = 𝑝 𝑡 𝑎 −1
𝑝 𝑡 𝑎 ∗ 𝑛𝑙𝑙 𝛼 𝑡 𝑠 ∗ 𝛽
𝑠: 𝑙 𝑠 =𝑎
𝑡 𝑠
8
BRIEF INTRODUCTION OF CTC Gradient Matrix Computation
6/7/2016
𝒃𝒍𝒂𝒏𝒌 𝑪 𝒃𝒍𝒂𝒏𝒌 𝑨 𝒃𝒍𝒂𝒏𝒌 𝑻 𝒃𝒍𝒂𝒏𝒌 𝒍
𝜶 ∗ 𝜷
𝒈 𝑏𝑙𝑎𝑛𝑘 𝐴 𝐵 𝐶 … 𝑍 𝑠𝑝𝑎𝑐𝑒
𝑡
𝑡
9
ALPHA/BETA MATRIX COMPUTATION
10
ALPHA/BETA MATRIX COMPUTATION
Each CUDA Block owns one sequence, i.e. #Block is the minibatch size
Each Thread owns one column of the Alpha/Bata Matrix.
Threads iterate over matrix rows with synchronizations after each iteration.
GPU Implementation
𝛼 𝑡 − 1 𝑠 − 2 𝛼 𝑡 − 1 𝑠 − 1 𝛼 𝑡 − 1 𝑠
𝛼 𝑡 𝑠
Thread 𝑠 − 2 Thread 𝑠 − 1 Thread 𝑠 Block
11
ALPHA/BETA MATRIX COMPUTATION Data Reuse
𝒍 𝒔 and 𝒍 𝒔 − 𝟐 will be used by all iterations
They are invariable across all iterations
So load them into Register File to be reused by all iterations in the thread
6/7/2016
𝑖𝑓 𝒍 𝒔 = 𝑏𝑙𝑎𝑛𝑘 𝑜𝑟 𝒍 𝒔 = 𝒍 𝒔 − 𝟐 ∶ 𝛼 𝑡 𝑠 = 𝛼 𝑡 − 1 𝑠 + 𝛼 𝑡 − 1 𝑠 − 1 ∗ 𝑝 𝑡 𝒍 𝒔 else : 𝛼 𝑡 𝑠 = 𝛼 𝑡 − 1 𝑠 + 𝛼 𝑡 − 1 𝑠 − 1 + 𝛼 𝑡 − 1 𝑠 − 2 ∗ 𝑝 𝑡 𝒍 𝒔
12
ALPHA/BETA MATRIX COMPUTATION Data Reuse
𝜶 𝒕 − 𝟏 𝒔 is output of last iteration of the same thread
Thus can be transferred through Register File
6/7/2016
𝑖𝑓 𝑙 𝑠 = 𝑏𝑙𝑎𝑛𝑘 𝑜𝑟 𝑙 𝑠 = 𝑙 𝑠 − 2 ∶ 𝛼 𝑡 𝑠 = 𝜶 𝒕 − 𝟏 𝒔 + 𝛼 𝑡 − 1 𝑠 − 1 ∗ 𝑝 𝑡 𝑙 𝑠 else : 𝛼 𝑡 𝑠 = 𝜶 𝒕 − 𝟏 𝒔 + 𝛼 𝑡 − 1 𝑠 − 1 + 𝛼 𝑡 − 1 𝑠 − 2 ∗ 𝑝 𝑡 𝑙 𝑠
13
ALPHA/BETA MATRIX COMPUTATION Data Reuse
𝜶 𝒕 − 𝟏 𝒔 − 𝟏 and 𝜶 𝒕 − 𝟏 𝒔 − 𝟐 are outputs of last iteration of the other threads in the same block
Thus can be transferred through Shared Memory
6/7/2016
𝑖𝑓 𝑙 𝑠 = 𝑏𝑙𝑎𝑛𝑘 𝑜𝑟 𝑙 𝑠 = 𝑙 𝑠 − 2 ∶ 𝛼 𝑡 𝑠 = 𝛼 𝑡 − 1 𝑠 + 𝜶 𝒕 − 𝟏 𝒔 − 𝟏 ∗ 𝑝 𝑡 𝑙 𝑠 else : 𝛼 𝑡 𝑠 = 𝛼 𝑡 − 1 𝑠 + 𝜶 𝒕 − 𝟏 𝒔 − 𝟏 + 𝜶 𝒕 − 𝟏 𝒔 − 𝟐 ∗ 𝑝 𝑡 𝑙 𝑠
14
ALPHA/BETA MATRIX COMPUTATION
T=150, L=40, A=28 warp-ctc optimized speedup
N=1 0.41ms 0.22ms 1.89x
N=16 0.42ms 0.23ms 1.84x
N=32 0.42ms 0.23ms 1.82x
N=64 0.43ms 0.26ms 1.70x
N=128 0.47ms 0.30ms 1.56x
Warp-ctc: https://github.com/baidu-research/warp-ctc
Performance on Titan X – Small Alphabet Size
15
ALPHA/BETA MATRIX COMPUTATION
T=150, L=20, A=5000 warp-ctc optimized speedup
N=1 0.41ms 0.25ms 1.65x
N=16 0.47ms 0.28ms 1.66x
N=32 0.47ms 0.28ms 1.65x
N=64 0.48ms 0.29ms 1.65x
N=128 0.50ms 0.30ms 1.68x
Warp-ctc: https://github.com/baidu-research/warp-ctc
Performance on Titan X – Large Alphabet Size
16
GRADIENT MATRIX COMPUTATION
17
GRADIENT MATRIX COMPUTATION
Each Block owns one row of Alpha and Beta Matrix, i.e. #Block = minibatch * T
Within each block, key-value reduction through Atomic operations on Shared Memory
GPU Implementation
𝜶 ∗ 𝜷
𝒈 𝑏𝑙𝑎𝑛𝑘 𝐴 𝐵 𝐶 … 𝑍 𝑠𝑝𝑎𝑐𝑒
Block 𝑡
Shared Memory of Block 𝑡
18
GRADIENT MATRIX COMPUTATION
Blanks contribute most of the address conflicts
We know their exact position in the label sequence
It becomes a normal parallel reduction problem to compute for blanks separately
Compute for Blanks Separately
𝜶 ∗ 𝜷
Block 𝑡
𝒃𝒍𝒂𝒏𝒌 𝑪 𝒃𝒍𝒂𝒏𝒌 𝑨 𝒃𝒍𝒂𝒏𝒌 𝑻 𝒃𝒍𝒂𝒏𝒌 𝒍
Shared Memory
Shared Memory
19
GRADIENT MATRIX COMPUTATION
It reduces address conflicts for atomic operations
Results in redundant shared memory elements are then accumulated for each character in parallel
Not applicable for languages with large alphabet size, like Chinese
Allocate Redundant Shared Memory
20
GRADIENT MATRIX COMPUTATION Reuse the memory of Matrix 𝑝 for Gradient Matrix 𝑔
Results in 0 for more than 99% characters in the large alphabet of Chinese.
So more than 99% elements of Matrix 𝑔 are the same as Matrix 𝑝, and nearly half time is spent on “copying” them from Matrix 𝑝 to Matrix 𝑔
Matrix 𝑝 will no longer be used after the gradient computation
Reusing the memory of Matrix 𝑝 for Gradient Matrix 𝑔 , we only need to update gradient of less than 1% Matrix elements
Not necessary for languages with small alphabet size, like English
6/7/2016
𝑔 𝑡 𝑎 = 𝑝 𝑡 𝑎 −1
𝑝 𝑡 𝑎 ∗ 𝑛𝑙𝑙 𝛼 𝑡 𝑠 ∗ 𝛽
𝑠: 𝑙 𝑠 =𝑎
𝑡 𝑠
21
GRADIENT MATRIX COMPUTATION
T=150, L=40, A=28 warp-ctc optimized speedup
N=1 2.16ms 0.02ms 134.89x
N=16 2.19ms 0.06ms 37.26x
N=32 2.20ms 0.11ms 19.32x
N=64 2.23ms 0.21ms 10.49x
N=128 2.24ms 0.41ms 5.52x
For warp-ctc, this is the run time of kernel compute_betas_grad_kernel minus the run time of compute_alpha_kernel
Performance on Titan X – Small Alphabet Size
22
GRADIENT MATRIX COMPUTATION
T=150, L=20, A=5000 warp-ctc optimized speedup
N=1 5.52ms 0.04ms 128.26x
N=16 6.36ms 0.21ms 30.28x
N=32 6.49ms 0.47ms 13.73x
N=64 6.75ms 0.78ms 8.67x
N=128 7.20ms 1.56ms 4.63x
For warp-ctc, this is the run time of kernel compute_betas_grad_kernel minus the run time of compute_alpha_kernel
Performance on Titan X – Large Alphabet Size
23
OVERALL PERFORMANCE
24
OVERALL PERFORMANCE
T=150, L=40, A=28 warp-ctc optimized speedup
N=1 2.98ms 0.45ms 6.57x
N=16 3.03ms 0.51ms 5.92x
N=32 3.05ms 0.58ms 5.25x
N=64 3.10ms 0.72ms 4.27x
N=128 3.18ms 1.01ms 3.14x
CTC(Alpha+Beta+Gradient) on Titan X – Small Alphabet Size
25
OVERALL PERFORMANCE
T=150, L=20, A=5000 warp-ctc optimized speedup
N=1 6.34ms 0.54ms 11.67x
N=16 7.30ms 0.77ms 9.43x
N=32 7.43ms 1.04ms 7.14x
N=64 7.71ms 1.36ms 5.67x
N=128 8.20ms 2.15ms 3.81x
CTC(Alpha+Beta+Gradient) on Titan X – Large Alphabet Size
26
OVERALL PERFORMANCE
T=150, L=40, A=28 warp-ctc optimized speedup
N=1 3.12ms 0.59ms 5.28x
N=16 3.16ms 0.65ms 4.89x
N=32 3.20ms 0.88ms 3.65x
N=64 3.30ms 1.08ms 3.07x
N=128 3.49ms 1.37ms 2.56x
Softmax+CTC on Titan X – Small Alphabet Size
27
OVERALL PERFORMANCE
T=150, L=20, A=5000 warp-ctc Optimized speedup
N=1 6.61ms 0.79ms 8.34x
N=16 9.13ms 2.69ms 3.40x
N=32 11.01ms 4.92ms 2.24x
N=64 14.83ms 8.67ms 1.71x
N=128 22.36ms 16.49ms 1.36x
Softmax+CTC on Titan X – Large Alphabet Size
April 4-7, 2016 | Silicon Valley
THANK YOU
JOIN THE NVIDIA DEVELOPER PROGRAM AT developer.nvidia.com/join