april 4-7, 2016 | silicon valley high performance ctc ... · for warp-ctc, this is the run time of...

28
April 4-7, 2016 | Silicon Valley Minmin Sun, NVIDIA [email protected] April 5th HIGH PERFORMANCE CTC TRAINING FOR END-TO-END SPEECH RECOGNITION ON GPU

Upload: others

Post on 23-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: April 4-7, 2016 | Silicon Valley HIGH PERFORMANCE CTC ... · For warp-ctc, this is the run time of kernel compute_betas_grad_kernel minus the run time of compute_alpha_kernel Performance

April 4-7, 2016 | Silicon Valley

Minmin Sun, NVIDIA

[email protected]

April 5th

HIGH PERFORMANCE CTC TRAINING FOR END-TO-END SPEECH RECOGNITION ON GPU

Page 2: April 4-7, 2016 | Silicon Valley HIGH PERFORMANCE CTC ... · For warp-ctc, this is the run time of kernel compute_betas_grad_kernel minus the run time of compute_alpha_kernel Performance

2

AGENDA

Brief Introduction of CTC

Alpha/Beta Matrix Computation

Gradient Matrix Computation

Overall Performance

Page 3: April 4-7, 2016 | Silicon Valley HIGH PERFORMANCE CTC ... · For warp-ctc, this is the run time of kernel compute_betas_grad_kernel minus the run time of compute_alpha_kernel Performance

3

BRIEF INTRODUCTION OF CTC

Page 4: April 4-7, 2016 | Silicon Valley HIGH PERFORMANCE CTC ... · For warp-ctc, this is the run time of kernel compute_betas_grad_kernel minus the run time of compute_alpha_kernel Performance

4

BRIEF INTRODUCTION OF CTC Overview

CTC is a loss function to train the RNN

Inputs: (1) 𝑝--softmax output (2) label sequence

Output: 𝑔--gradient w.r.t. output layer

CTC Includes: (1)Alpha computation (2)Beta computation (3)Gradient computation

6/7/2016

Hidden layer

…. Hidden layer

Hidden layer

….

frame t-1 frame t frame t+1

h[t-1]

Output layer

y[t-1]

Softmax

𝑝[𝑡-1]

CTC

𝑔[𝑡-1]

Label sequence: ‘C’, ‘A’, ‘T’

h[t]

Output layer

y[t]

Softmax

𝑝[𝑡] 𝑔[𝑡]

h[t+1]

Output layer

y[t+1]

Softmax

𝑝[𝑡+1] 𝑔[𝑡+1]

RNN

Page 5: April 4-7, 2016 | Silicon Valley HIGH PERFORMANCE CTC ... · For warp-ctc, this is the run time of kernel compute_betas_grad_kernel minus the run time of compute_alpha_kernel Performance

5

BRIEF INTRODUCTION OF CTC Alpha/Beta Matrix Computation

Matrix Dim: 𝑇 rows * 𝑆 columns

𝑆 = 2𝐿 + 1 is the length of augmented label sequence 𝑙

𝐿 is the number of characters in the original label sequence

𝑇 is the number of time-steps in the utterance

6/7/2016

𝑖𝑓 𝑙 𝑠 = 𝑏𝑙𝑎𝑛𝑘 𝑜𝑟 𝑙 𝑠 = 𝑙 𝑠 − 2 ∶ 𝛼 𝑡 𝑠 = 𝛼 𝑡 − 1 𝑠 + 𝛼 𝑡 − 1 𝑠 − 1 ∗ 𝑝 𝑡 𝑙 𝑠 else : 𝛼 𝑡 𝑠 = 𝛼 𝑡 − 1 𝑠 + 𝛼 𝑡 − 1 𝑠 − 1 + 𝛼 𝑡 − 1 𝑠 − 2 ∗ 𝑝 𝑡 𝑙 𝑠

Page 6: April 4-7, 2016 | Silicon Valley HIGH PERFORMANCE CTC ... · For warp-ctc, this is the run time of kernel compute_betas_grad_kernel minus the run time of compute_alpha_kernel Performance

6

BRIEF INTRODUCTION OF CTC Alpha/Beta Matrix Computation

6/7/2016

𝛼 𝑡 − 1 𝑠 − 2 𝛼 𝑡 − 1 𝑠 − 1 𝛼 𝑡 − 1 𝑠

𝛼 𝑡 𝑠

𝒃𝒍𝒂𝒏𝒌 𝒄 𝒃𝒍𝒂𝒏𝒌 𝒂 𝒃𝒍𝒂𝒏𝒌 𝒕 𝒃𝒍𝒂𝒏𝒌

𝒍

𝜶 𝑠 − 2 𝑠 − 1 𝑠

𝑡 − 1

𝑡

Page 7: April 4-7, 2016 | Silicon Valley HIGH PERFORMANCE CTC ... · For warp-ctc, this is the run time of kernel compute_betas_grad_kernel minus the run time of compute_alpha_kernel Performance

7

BRIEF INTRODUCTION OF CTC Gradient Matrix Computation

Matrix Dim: 𝑇 rows * 𝐴 columns

𝐴 is the alphabet size, e.g. 28 for English

key-value reduction using the character 𝑙 𝑠 as key

6/7/2016

𝑔 𝑡 𝑎 = 𝑝 𝑡 𝑎 −1

𝑝 𝑡 𝑎 ∗ 𝑛𝑙𝑙 𝛼 𝑡 𝑠 ∗ 𝛽

𝑠: 𝑙 𝑠 =𝑎

𝑡 𝑠

Page 8: April 4-7, 2016 | Silicon Valley HIGH PERFORMANCE CTC ... · For warp-ctc, this is the run time of kernel compute_betas_grad_kernel minus the run time of compute_alpha_kernel Performance

8

BRIEF INTRODUCTION OF CTC Gradient Matrix Computation

6/7/2016

𝒃𝒍𝒂𝒏𝒌 𝑪 𝒃𝒍𝒂𝒏𝒌 𝑨 𝒃𝒍𝒂𝒏𝒌 𝑻 𝒃𝒍𝒂𝒏𝒌 𝒍

𝜶 ∗ 𝜷

𝒈 𝑏𝑙𝑎𝑛𝑘 𝐴 𝐵 𝐶 … 𝑍 𝑠𝑝𝑎𝑐𝑒

𝑡

𝑡

Page 9: April 4-7, 2016 | Silicon Valley HIGH PERFORMANCE CTC ... · For warp-ctc, this is the run time of kernel compute_betas_grad_kernel minus the run time of compute_alpha_kernel Performance

9

ALPHA/BETA MATRIX COMPUTATION

Page 10: April 4-7, 2016 | Silicon Valley HIGH PERFORMANCE CTC ... · For warp-ctc, this is the run time of kernel compute_betas_grad_kernel minus the run time of compute_alpha_kernel Performance

10

ALPHA/BETA MATRIX COMPUTATION

Each CUDA Block owns one sequence, i.e. #Block is the minibatch size

Each Thread owns one column of the Alpha/Bata Matrix.

Threads iterate over matrix rows with synchronizations after each iteration.

GPU Implementation

𝛼 𝑡 − 1 𝑠 − 2 𝛼 𝑡 − 1 𝑠 − 1 𝛼 𝑡 − 1 𝑠

𝛼 𝑡 𝑠

Thread 𝑠 − 2 Thread 𝑠 − 1 Thread 𝑠 Block

Page 11: April 4-7, 2016 | Silicon Valley HIGH PERFORMANCE CTC ... · For warp-ctc, this is the run time of kernel compute_betas_grad_kernel minus the run time of compute_alpha_kernel Performance

11

ALPHA/BETA MATRIX COMPUTATION Data Reuse

𝒍 𝒔 and 𝒍 𝒔 − 𝟐 will be used by all iterations

They are invariable across all iterations

So load them into Register File to be reused by all iterations in the thread

6/7/2016

𝑖𝑓 𝒍 𝒔 = 𝑏𝑙𝑎𝑛𝑘 𝑜𝑟 𝒍 𝒔 = 𝒍 𝒔 − 𝟐 ∶ 𝛼 𝑡 𝑠 = 𝛼 𝑡 − 1 𝑠 + 𝛼 𝑡 − 1 𝑠 − 1 ∗ 𝑝 𝑡 𝒍 𝒔 else : 𝛼 𝑡 𝑠 = 𝛼 𝑡 − 1 𝑠 + 𝛼 𝑡 − 1 𝑠 − 1 + 𝛼 𝑡 − 1 𝑠 − 2 ∗ 𝑝 𝑡 𝒍 𝒔

Page 12: April 4-7, 2016 | Silicon Valley HIGH PERFORMANCE CTC ... · For warp-ctc, this is the run time of kernel compute_betas_grad_kernel minus the run time of compute_alpha_kernel Performance

12

ALPHA/BETA MATRIX COMPUTATION Data Reuse

𝜶 𝒕 − 𝟏 𝒔 is output of last iteration of the same thread

Thus can be transferred through Register File

6/7/2016

𝑖𝑓 𝑙 𝑠 = 𝑏𝑙𝑎𝑛𝑘 𝑜𝑟 𝑙 𝑠 = 𝑙 𝑠 − 2 ∶ 𝛼 𝑡 𝑠 = 𝜶 𝒕 − 𝟏 𝒔 + 𝛼 𝑡 − 1 𝑠 − 1 ∗ 𝑝 𝑡 𝑙 𝑠 else : 𝛼 𝑡 𝑠 = 𝜶 𝒕 − 𝟏 𝒔 + 𝛼 𝑡 − 1 𝑠 − 1 + 𝛼 𝑡 − 1 𝑠 − 2 ∗ 𝑝 𝑡 𝑙 𝑠

Page 13: April 4-7, 2016 | Silicon Valley HIGH PERFORMANCE CTC ... · For warp-ctc, this is the run time of kernel compute_betas_grad_kernel minus the run time of compute_alpha_kernel Performance

13

ALPHA/BETA MATRIX COMPUTATION Data Reuse

𝜶 𝒕 − 𝟏 𝒔 − 𝟏 and 𝜶 𝒕 − 𝟏 𝒔 − 𝟐 are outputs of last iteration of the other threads in the same block

Thus can be transferred through Shared Memory

6/7/2016

𝑖𝑓 𝑙 𝑠 = 𝑏𝑙𝑎𝑛𝑘 𝑜𝑟 𝑙 𝑠 = 𝑙 𝑠 − 2 ∶ 𝛼 𝑡 𝑠 = 𝛼 𝑡 − 1 𝑠 + 𝜶 𝒕 − 𝟏 𝒔 − 𝟏 ∗ 𝑝 𝑡 𝑙 𝑠 else : 𝛼 𝑡 𝑠 = 𝛼 𝑡 − 1 𝑠 + 𝜶 𝒕 − 𝟏 𝒔 − 𝟏 + 𝜶 𝒕 − 𝟏 𝒔 − 𝟐 ∗ 𝑝 𝑡 𝑙 𝑠

Page 14: April 4-7, 2016 | Silicon Valley HIGH PERFORMANCE CTC ... · For warp-ctc, this is the run time of kernel compute_betas_grad_kernel minus the run time of compute_alpha_kernel Performance

14

ALPHA/BETA MATRIX COMPUTATION

T=150, L=40, A=28 warp-ctc optimized speedup

N=1 0.41ms 0.22ms 1.89x

N=16 0.42ms 0.23ms 1.84x

N=32 0.42ms 0.23ms 1.82x

N=64 0.43ms 0.26ms 1.70x

N=128 0.47ms 0.30ms 1.56x

Warp-ctc: https://github.com/baidu-research/warp-ctc

Performance on Titan X – Small Alphabet Size

Page 15: April 4-7, 2016 | Silicon Valley HIGH PERFORMANCE CTC ... · For warp-ctc, this is the run time of kernel compute_betas_grad_kernel minus the run time of compute_alpha_kernel Performance

15

ALPHA/BETA MATRIX COMPUTATION

T=150, L=20, A=5000 warp-ctc optimized speedup

N=1 0.41ms 0.25ms 1.65x

N=16 0.47ms 0.28ms 1.66x

N=32 0.47ms 0.28ms 1.65x

N=64 0.48ms 0.29ms 1.65x

N=128 0.50ms 0.30ms 1.68x

Warp-ctc: https://github.com/baidu-research/warp-ctc

Performance on Titan X – Large Alphabet Size

Page 16: April 4-7, 2016 | Silicon Valley HIGH PERFORMANCE CTC ... · For warp-ctc, this is the run time of kernel compute_betas_grad_kernel minus the run time of compute_alpha_kernel Performance

16

GRADIENT MATRIX COMPUTATION

Page 17: April 4-7, 2016 | Silicon Valley HIGH PERFORMANCE CTC ... · For warp-ctc, this is the run time of kernel compute_betas_grad_kernel minus the run time of compute_alpha_kernel Performance

17

GRADIENT MATRIX COMPUTATION

Each Block owns one row of Alpha and Beta Matrix, i.e. #Block = minibatch * T

Within each block, key-value reduction through Atomic operations on Shared Memory

GPU Implementation

𝜶 ∗ 𝜷

𝒈 𝑏𝑙𝑎𝑛𝑘 𝐴 𝐵 𝐶 … 𝑍 𝑠𝑝𝑎𝑐𝑒

Block 𝑡

Shared Memory of Block 𝑡

Page 18: April 4-7, 2016 | Silicon Valley HIGH PERFORMANCE CTC ... · For warp-ctc, this is the run time of kernel compute_betas_grad_kernel minus the run time of compute_alpha_kernel Performance

18

GRADIENT MATRIX COMPUTATION

Blanks contribute most of the address conflicts

We know their exact position in the label sequence

It becomes a normal parallel reduction problem to compute for blanks separately

Compute for Blanks Separately

𝜶 ∗ 𝜷

Block 𝑡

𝒃𝒍𝒂𝒏𝒌 𝑪 𝒃𝒍𝒂𝒏𝒌 𝑨 𝒃𝒍𝒂𝒏𝒌 𝑻 𝒃𝒍𝒂𝒏𝒌 𝒍

Shared Memory

Shared Memory

Page 19: April 4-7, 2016 | Silicon Valley HIGH PERFORMANCE CTC ... · For warp-ctc, this is the run time of kernel compute_betas_grad_kernel minus the run time of compute_alpha_kernel Performance

19

GRADIENT MATRIX COMPUTATION

It reduces address conflicts for atomic operations

Results in redundant shared memory elements are then accumulated for each character in parallel

Not applicable for languages with large alphabet size, like Chinese

Allocate Redundant Shared Memory

Page 20: April 4-7, 2016 | Silicon Valley HIGH PERFORMANCE CTC ... · For warp-ctc, this is the run time of kernel compute_betas_grad_kernel minus the run time of compute_alpha_kernel Performance

20

GRADIENT MATRIX COMPUTATION Reuse the memory of Matrix 𝑝 for Gradient Matrix 𝑔

Results in 0 for more than 99% characters in the large alphabet of Chinese.

So more than 99% elements of Matrix 𝑔 are the same as Matrix 𝑝, and nearly half time is spent on “copying” them from Matrix 𝑝 to Matrix 𝑔

Matrix 𝑝 will no longer be used after the gradient computation

Reusing the memory of Matrix 𝑝 for Gradient Matrix 𝑔 , we only need to update gradient of less than 1% Matrix elements

Not necessary for languages with small alphabet size, like English

6/7/2016

𝑔 𝑡 𝑎 = 𝑝 𝑡 𝑎 −1

𝑝 𝑡 𝑎 ∗ 𝑛𝑙𝑙 𝛼 𝑡 𝑠 ∗ 𝛽

𝑠: 𝑙 𝑠 =𝑎

𝑡 𝑠

Page 21: April 4-7, 2016 | Silicon Valley HIGH PERFORMANCE CTC ... · For warp-ctc, this is the run time of kernel compute_betas_grad_kernel minus the run time of compute_alpha_kernel Performance

21

GRADIENT MATRIX COMPUTATION

T=150, L=40, A=28 warp-ctc optimized speedup

N=1 2.16ms 0.02ms 134.89x

N=16 2.19ms 0.06ms 37.26x

N=32 2.20ms 0.11ms 19.32x

N=64 2.23ms 0.21ms 10.49x

N=128 2.24ms 0.41ms 5.52x

For warp-ctc, this is the run time of kernel compute_betas_grad_kernel minus the run time of compute_alpha_kernel

Performance on Titan X – Small Alphabet Size

Page 22: April 4-7, 2016 | Silicon Valley HIGH PERFORMANCE CTC ... · For warp-ctc, this is the run time of kernel compute_betas_grad_kernel minus the run time of compute_alpha_kernel Performance

22

GRADIENT MATRIX COMPUTATION

T=150, L=20, A=5000 warp-ctc optimized speedup

N=1 5.52ms 0.04ms 128.26x

N=16 6.36ms 0.21ms 30.28x

N=32 6.49ms 0.47ms 13.73x

N=64 6.75ms 0.78ms 8.67x

N=128 7.20ms 1.56ms 4.63x

For warp-ctc, this is the run time of kernel compute_betas_grad_kernel minus the run time of compute_alpha_kernel

Performance on Titan X – Large Alphabet Size

Page 23: April 4-7, 2016 | Silicon Valley HIGH PERFORMANCE CTC ... · For warp-ctc, this is the run time of kernel compute_betas_grad_kernel minus the run time of compute_alpha_kernel Performance

23

OVERALL PERFORMANCE

Page 24: April 4-7, 2016 | Silicon Valley HIGH PERFORMANCE CTC ... · For warp-ctc, this is the run time of kernel compute_betas_grad_kernel minus the run time of compute_alpha_kernel Performance

24

OVERALL PERFORMANCE

T=150, L=40, A=28 warp-ctc optimized speedup

N=1 2.98ms 0.45ms 6.57x

N=16 3.03ms 0.51ms 5.92x

N=32 3.05ms 0.58ms 5.25x

N=64 3.10ms 0.72ms 4.27x

N=128 3.18ms 1.01ms 3.14x

CTC(Alpha+Beta+Gradient) on Titan X – Small Alphabet Size

Page 25: April 4-7, 2016 | Silicon Valley HIGH PERFORMANCE CTC ... · For warp-ctc, this is the run time of kernel compute_betas_grad_kernel minus the run time of compute_alpha_kernel Performance

25

OVERALL PERFORMANCE

T=150, L=20, A=5000 warp-ctc optimized speedup

N=1 6.34ms 0.54ms 11.67x

N=16 7.30ms 0.77ms 9.43x

N=32 7.43ms 1.04ms 7.14x

N=64 7.71ms 1.36ms 5.67x

N=128 8.20ms 2.15ms 3.81x

CTC(Alpha+Beta+Gradient) on Titan X – Large Alphabet Size

Page 26: April 4-7, 2016 | Silicon Valley HIGH PERFORMANCE CTC ... · For warp-ctc, this is the run time of kernel compute_betas_grad_kernel minus the run time of compute_alpha_kernel Performance

26

OVERALL PERFORMANCE

T=150, L=40, A=28 warp-ctc optimized speedup

N=1 3.12ms 0.59ms 5.28x

N=16 3.16ms 0.65ms 4.89x

N=32 3.20ms 0.88ms 3.65x

N=64 3.30ms 1.08ms 3.07x

N=128 3.49ms 1.37ms 2.56x

Softmax+CTC on Titan X – Small Alphabet Size

Page 27: April 4-7, 2016 | Silicon Valley HIGH PERFORMANCE CTC ... · For warp-ctc, this is the run time of kernel compute_betas_grad_kernel minus the run time of compute_alpha_kernel Performance

27

OVERALL PERFORMANCE

T=150, L=20, A=5000 warp-ctc Optimized speedup

N=1 6.61ms 0.79ms 8.34x

N=16 9.13ms 2.69ms 3.40x

N=32 11.01ms 4.92ms 2.24x

N=64 14.83ms 8.67ms 1.71x

N=128 22.36ms 16.49ms 1.36x

Softmax+CTC on Titan X – Large Alphabet Size

Page 28: April 4-7, 2016 | Silicon Valley HIGH PERFORMANCE CTC ... · For warp-ctc, this is the run time of kernel compute_betas_grad_kernel minus the run time of compute_alpha_kernel Performance

April 4-7, 2016 | Silicon Valley

THANK YOU

JOIN THE NVIDIA DEVELOPER PROGRAM AT developer.nvidia.com/join