the ustc-iflytek system for chime-5...

Post on 24-May-2020

17 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

The USTC-iFlytek System for CHiME-5 Challenge

Jun Du

2018.09.07

@Hyderabad, India

1

2

Tian Gao (USTC) Lei Sun (USTC)

Feng Ma (iFlytek) Yi Fang (iFlytek) Di-Yuan Liu (iFlytek) Qiang Zhang (iFlytek) Xiang Zhang (iFlytek)

Hai-Kun Wang (iFlytek)

Jia Pan(iFlytek)

Jian-Qing Gao(iFlytek)

Jun Du (USTC)

Team

Chin-Hui Lee (GIT)

Jing-Dong Chen (NWPU)

System Overview (I)

3

System Overview (II)

4

System Overview (III)

5

Implementation Platform

• The official Kaldi toolkit• Features: MFCC features• GMM-HMM acoustic model• LF-BLSTM HMM acoustic model• LF-CNN-TDNN-LSTM HMM acoustic model• Model ensemble

• The CNTK toolkit• LSTM-based single-channel speech separation models• LSTM-based single-channel speech enhancement models

• Self-developed toolkit• Beamforming• CNN-HMM acoustic models• LSTM language models

6

WPE + Denoising (IVA, LSA)

• Blind background noise reduction as preprocessing• Important to make the subsequent separation/beamforming working

• Step 1: Generalized Weighted Prediction Error (GWPE) [1]• Step 2: Independent Vector Analysis & Back Projection (IVA-BP, N=M=4) [2]• Step 3: Multichannel noise reduction using log-spectral amplitude (LSA) [3]

1 2( , ) , ... ( , ) ( ) ( , )N IVA BP WPEY t f G G G t f f t f−= W X

GWPEIVA &

Back ProjectionMulti-

channel NREnhanced

signals mixing

1( , )Mt f X 1( , )WPE Mt f X1( , )Nt f S

1( , )NR Nt f S 1 1( , )Y t f

Multi-channel NR

Multi-channel NRMulti-channel

NR(N)

[1] T. Yoshioka and T. Nakatani , “Generalization of multi-channel linear prediction methods for blind MIMO impulse response shortening “, IEEE TASLP, vol. 20, no. 10, pp.2707-2720, 2012.

[2] N. Ono, “Stable and fast update rules for independent vector analysis based on auxiliary function technique”, IEEE WASPAA, 2011, pp.189-192.

[3] I. Cohen, “Multichannel post-filtering in non-stationary noise environments,” IEEE TSP, vol. 52, no. 5, pp.1149-1160, 2004.

7

Single-channel speech separation

8

• Motivation of SS1 • Using non-overlap data from oracle diarization information

• Problems of “non-overlap” data• Insufficient

• Not pure

• More pure non-overlap data is necessary• SS1 aiming at removing interference clearly with potential target distortions

• Separated target data by SS1 as the new non-overlap data

• Objective function of SS1

9

Y denotes input feature, X denotes learning target, IRM denotes network output

SS1

• Motivation of SS2• Large speech distortions of target introduced by SS1 models

• Aiming at better speech preservation for ASR task

• Target training data of SS2• Using the separated target data by SS1

• More target data for SS2 training than that for SS1 training

• Objective function of SS2

10

SS2

• Neural network architecture• 2-layer BLSTM network• 512 cells per LSTM layer

• Input features• Log-power spectral features• Frame-length: 32ms• Frame-shift: 16ms

• Training data for each speaker-dependent model • Interfering speakers: other 3 speakers• Simulated mixing data size: about 50 hours• Input SNR: -5dB, 0dB, 5dB, 10dB

11

Setup of SS1 and SS2

Single-channel speech enhancement

• Densely connected progressive learning for LSTM [1]• Target data (or “clean data”)

• 40-hour preprocessed data by WPE+Denoising

• Noise data• unlabeled segments of channel-1 in training sets filtered using ASR model

• Input SNR of training data: -5 dB, 0 dB, 5 dB• Simulated training data size: 120 hours• Architecture: the best configuration in [1]• Objective function of the output layer in progressive learning

• Testing stage: channel-1 as the input

12

[1] T. Gao, J. Du, L.-R. Dai and C.-H. Lee, “Densely connected progressive learning for LSTM-based speech enhancement,” ICASSP 2018.

Beamforming

• DL-CGMM-GEVD source separation

• : the posterior probability of TF bin belong to source • Three sources: target speech, interfering speech, background noise• Using mask outputs of SS/SE deep models to initialize CGMM parameters

• Extension of CGMM in [1] from 2 Gaussian mixtures to 3 Gaussian mixtures

• Well addressing the source order permutation problem

• Two-pass array selection for multi-array track• Selecting 3 arrays using SNR for SS model training (1-pass) and SINR for ASR tasks (2-pass)

• Fusing the recognition results of time-aligned 3 arrays via acoustic model ensemble

13

SS-Model

SE-Model

init

Tinit

I

init

N

CGMM [1]Mask

Estimator v

GEVD [2]Beamforming

S

,SNR SINR

ASR

Array Selection

( , )v t f v

[1] T. Higuchi, N. Ito, T. Yoshioka, and T. Nakatani, “Robust mvdr beamforming using time-frequency masks for online/offline asr in noise,” in ICASSP, 2016.

[2] E. Warsitz and R. Haeb-Umbach, “Blind acoustic beamforming based on generalized eigenvalue decomposition,” IEEE TASLP, vol. 15, no. 5, pp.1529-1539, 2007.

Speech Demo

14

Original, channel-1

WPE, IVA, LSA (dereverberation and denoising as preprocessing)

Official Beamforming (interfering male speaker is still existing)

Single-channel SS1 (good suppression of interference, large distortions of target)

Single-channel SS2 (worse suppression of interference, smaller distortions of target)

Beamforming with SS2 (the best trade-off)

Front-end (Official vs. Ours)

85.03

79.7278.39

81.1480.23

76.31

71.16

75.72

72.27

67.71

63.33

67.66

KITCHEN DINING LIVING ALL

WER

(%

)

Single-Array Rank-A

Official Ours (Single-channel) Ours (Multi-channel)

+15.0%

+15.1%

+19.2%

Results on development sets using the official baseline LF-TDNN system

+16.6%

Acoustic Data Augmentation

• Worn data:• Left-channel + right-channel• Data cleanup (as used in baseline system)• Data size: 64 hours

• Far-field data:• Preprocessed data (WPE+IVA+LSA) of all arrays• Data cleanup• Data size: 110 hours + 110 hours (after front-end)

• Simulated far-field Data:• Calculating 1000+ RIRs using the recording pairs of worn and far-field• Using RIRs and noise segments to simulate far-field data from worn data• Data size: 250 hours

• Total training data: 534 hours

16

孙磊�

Lattice-Free MMI [1] Based AMs

• LF-BLSTM• 5-layer BLSTM network

• 40-d MFCC

• 100-d i-vector

• LF-CNN-TDNN-LSTM• 2-layer CNN + 9-layer TDNN + 3-layer LSTM network

• 40-d MFCC

• 100-d i-vector

17

[1] D. Povey, V. Peddinti, D. Galvez, P. Ghahremani, V. Manohar, X. Na, Y. Wang, and S. Khudanpur, "Purely sequence-trained neural networks for ASR based on lattice-free MMI", in Proc. Interspeech, 2016, pp.2751-2755.

Cross-Entropy Based AMs

• CNN1:• CLDNN• Input1: 40-d LMFB• Input2: Waveform

• CNN2: • 50 layers deep fully CNN• Input1: 40-d LMFB• Input2: Waveform

• CNN3:• 50 layers deep fully CNN with gate on feature map• Input1: 40-d LMFB• Input2: Waveform

18

19

wav

conv

relu

pow

pool

log

reshape

conv

BN

relu

pool

conv

BN

relu

pool

conv

BN

relu

conv

BN

relu

conv

BN

relu

pool

conv

BN

relu

conv

BN

relu

fbk

concat

BLSTM

FC

BN

relu

BLSTM

FC

BN

relu

BLSTM

FC

BN

relu

FC

LOSS

1, T*160 40, T

128,1,1025,1,16

128,1,20,1,10

64,15,3,1,1

64,3,2,3,2

96,7,3,1,1

96,3,1,3,1

128,5,3,1,1

128,5,3,1,1

128,9,3,1,1

128,3,2,3,2

256,7,3,1,1

256,5,3,1,1

deconv

c1250 p350

700,1,1,1,1

c1250 p350

700,1,1,1,1

c1250 p350

2048,1,1,1,1

512,1,2,1,2

channel_out,kernel_h,kernel_w, stride_h, stride_w

3936,1,1,1,1

CLDNN

128, T

AMs with Our Best Front-end• Ensemble via the state posterior average and lattice combination

• 5-model ensemble (LF-BLSTM, LF-CNN-TDNN-LSTM, CNN-Ensemble)

72.27

67.71

63.33

67.66

64

60.56

54.51

59.4461.23

58.65

52.39

57.1357.38

54.42

47.14

52.64

55.55

52.44

44.98

50.64

KITCHEN DINING LIVING ALL

WER

(%

)

Single-Array Rank-A

Official-AM LF-BLSTM LF-CNN-TDNN-LSTM CNN-Ensemble Ensemble

+15.3%+23.1%

+25.2%

+29.0%

NO data-augmentation

Language Model

LSTM-LM (Forward) and BLSTM-LM (Forward-Backward) are combined

One-Hot(12810)

Embedding(256)

Ig

C

h

1.0

OF

Output(12810)

LSTM(512)

Projection(256)

LSTMP

LMs (Rank A vs. Rank B)

22

55.55

52.44

44.98

50.64

55.11

51.74

44.69

50.2

KITCHEN DINING LIVING ALL

WER

(%

)

Single-Array

Official-LM Ours-LM

Not significant due to too little training data

+0.9%

LMs (Rank A vs. Rank B)

23

46.2

47.1

44.25

45.6545.49

46.78

43.61

45.05

KITCHEN DINING LIVING ALL

WER

(%

)

Multiple-Array

Official-LM Ours-LM

+1.3%

Summary of Single-array

24

81.14

73.27

67.66

50.64

46.42

DEV EVAL

WER

(%

)

Rank-A

Official Baseline with ours Front-end with ours Front-end and Back-end

+37.6%

+36.6%

+16.6%

Front-end and back-end are equally important.

Our system is robust to evaluation set.

Summary of Multiple-array

25

50.64

46.42

45.65

46.63

DEV EVAL

WER

(%

)

Rank-A

Single-Array Multiple-Array

-0.5%+9.9%

Array selection didn’t work on the evaluation set

Summary: Details of Rank-A

26

Both systems of single-array and multiple-array are the best

Summary: Details of Rank-B

27

Both systems of single-array and multiple-array are the best

Take-Home Message• Front-End

• Dinner party scenario is extremely challenging• Most previous techniques in CHiME-4 are not working well • We design a solution to utilize both traditional and DL techniques

• Acoustic Model• Data augmentation is important with contributions from front-end• The new design of CLDNN achieves the best performance• Different deep architectures are complimentary

• Language Model• LSTM-LMs are not effective due to the limited training data

• Multiple-array• Our proposed array selection is effective on the development set• More analysis should be done on the evaluation set

28

Acknowledgement

• JSALT 2017• Team: Enhancement and Analysis of Conversational Speech

• DIHARD Challenge (Interspeech 2018 Special Session)

• Inspiration for front-end design of CHiME-5 Challenge

29

ThanksQ&A

30

top related