the ustc-iflytek system for chime-5...

The USTC-iFlytek System for CHiME-5 Challenge

Jun Du

2018.09.07

@Hyderabad, India

Tian Gao (USTC) Lei Sun (USTC)

Feng Ma (iFlytek) Yi Fang (iFlytek) Di-Yuan Liu (iFlytek) Qiang Zhang (iFlytek) Xiang Zhang (iFlytek)

Hai-Kun Wang (iFlytek)

Jia Pan(iFlytek)

Jian-Qing Gao(iFlytek)

Jun Du (USTC)

Chin-Hui Lee (GIT)

Jing-Dong Chen (NWPU)

System Overview (I)

System Overview (II)

System Overview (III)

Implementation Platform

• The official Kaldi toolkit• Features: MFCC features• GMM-HMM acoustic model• LF-BLSTM HMM acoustic model• LF-CNN-TDNN-LSTM HMM acoustic model• Model ensemble

• The CNTK toolkit• LSTM-based single-channel speech separation models• LSTM-based single-channel speech enhancement models

• Self-developed toolkit• Beamforming• CNN-HMM acoustic models• LSTM language models

WPE + Denoising (IVA, LSA)

• Blind background noise reduction as preprocessing• Important to make the subsequent separation/beamforming working

• Step 1: Generalized Weighted Prediction Error (GWPE) [1]• Step 2: Independent Vector Analysis & Back Projection (IVA-BP, N=M=4) [2]• Step 3: Multichannel noise reduction using log-spectral amplitude (LSA) [3]

1 2( , ) , ... ( , ) ( ) ( , )N IVA BP WPEY t f G G G t f f t f−= W X

GWPEIVA &

Back ProjectionMulti-

channel NREnhanced

signals mixing

1( , )Mt f X 1( , )WPE Mt f X1( , )Nt f S

1( , )NR Nt f S 1 1( , )Y t f

Multi-channel NR

Multi-channel NRMulti-channel

[1] T. Yoshioka and T. Nakatani , “Generalization of multi-channel linear prediction methods for blind MIMO impulse response shortening “, IEEE TASLP, vol. 20, no. 10, pp.2707-2720, 2012.

[2] N. Ono, “Stable and fast update rules for independent vector analysis based on auxiliary function technique”, IEEE WASPAA, 2011, pp.189-192.

[3] I. Cohen, “Multichannel post-filtering in non-stationary noise environments,” IEEE TSP, vol. 52, no. 5, pp.1149-1160, 2004.

Single-channel speech separation

• Motivation of SS1 • Using non-overlap data from oracle diarization information

• Problems of “non-overlap” data• Insufficient

• Not pure

• More pure non-overlap data is necessary• SS1 aiming at removing interference clearly with potential target distortions

• Separated target data by SS1 as the new non-overlap data

• Objective function of SS1

Y denotes input feature, X denotes learning target, IRM denotes network output

• Motivation of SS2• Large speech distortions of target introduced by SS1 models

• Aiming at better speech preservation for ASR task

• Target training data of SS2• Using the separated target data by SS1

• More target data for SS2 training than that for SS1 training

• Objective function of SS2

• Neural network architecture• 2-layer BLSTM network• 512 cells per LSTM layer

• Input features• Log-power spectral features• Frame-length: 32ms• Frame-shift: 16ms

• Training data for each speaker-dependent model • Interfering speakers: other 3 speakers• Simulated mixing data size: about 50 hours• Input SNR: -5dB, 0dB, 5dB, 10dB

Setup of SS1 and SS2

Single-channel speech enhancement

• Densely connected progressive learning for LSTM [1]• Target data (or “clean data”)

• 40-hour preprocessed data by WPE+Denoising

• Noise data• unlabeled segments of channel-1 in training sets filtered using ASR model

• Input SNR of training data: -5 dB, 0 dB, 5 dB• Simulated training data size: 120 hours• Architecture: the best configuration in [1]• Objective function of the output layer in progressive learning

• Testing stage: channel-1 as the input

[1] T. Gao, J. Du, L.-R. Dai and C.-H. Lee, “Densely connected progressive learning for LSTM-based speech enhancement,” ICASSP 2018.

Beamforming

• DL-CGMM-GEVD source separation

• : the posterior probability of TF bin belong to source • Three sources: target speech, interfering speech, background noise• Using mask outputs of SS/SE deep models to initialize CGMM parameters

• Extension of CGMM in [1] from 2 Gaussian mixtures to 3 Gaussian mixtures

• Well addressing the source order permutation problem

• Two-pass array selection for multi-array track• Selecting 3 arrays using SNR for SS model training (1-pass) and SINR for ASR tasks (2-pass)

• Fusing the recognition results of time-aligned 3 arrays via acoustic model ensemble

SS-Model

SE-Model

CGMM [1]Mask

Estimator v

GEVD [2]Beamforming

,SNR SINR

Array Selection

( , )v t f v

[1] T. Higuchi, N. Ito, T. Yoshioka, and T. Nakatani, “Robust mvdr beamforming using time-frequency masks for online/offline asr in noise,” in ICASSP, 2016.

[2] E. Warsitz and R. Haeb-Umbach, “Blind acoustic beamforming based on generalized eigenvalue decomposition,” IEEE TASLP, vol. 15, no. 5, pp.1529-1539, 2007.

Speech Demo

Original, channel-1

WPE, IVA, LSA (dereverberation and denoising as preprocessing)

Official Beamforming (interfering male speaker is still existing)

Single-channel SS1 (good suppression of interference, large distortions of target)

Single-channel SS2 (worse suppression of interference, smaller distortions of target)

Beamforming with SS2 (the best trade-off)

Front-end (Official vs. Ours)

79.7278.39

81.1480.23

KITCHEN DINING LIVING ALL

Single-Array Rank-A

Official Ours (Single-channel) Ours (Multi-channel)

+15.0%

+15.1%

+19.2%

Results on development sets using the official baseline LF-TDNN system

+16.6%

Acoustic Data Augmentation

• Worn data:• Left-channel + right-channel• Data cleanup (as used in baseline system)• Data size: 64 hours

• Far-field data:• Preprocessed data (WPE+IVA+LSA) of all arrays• Data cleanup• Data size: 110 hours + 110 hours (after front-end)

• Simulated far-field Data:• Calculating 1000+ RIRs using the recording pairs of worn and far-field• Using RIRs and noise segments to simulate far-field data from worn data• Data size: 250 hours

• Total training data: 534 hours

Lattice-Free MMI [1] Based AMs

• LF-BLSTM• 5-layer BLSTM network

• 40-d MFCC

• 100-d i-vector

• LF-CNN-TDNN-LSTM• 2-layer CNN + 9-layer TDNN + 3-layer LSTM network

• 40-d MFCC

• 100-d i-vector

[1] D. Povey, V. Peddinti, D. Galvez, P. Ghahremani, V. Manohar, X. Na, Y. Wang, and S. Khudanpur, "Purely sequence-trained neural networks for ASR based on lattice-free MMI", in Proc. Interspeech, 2016, pp.2751-2755.

Cross-Entropy Based AMs

• CNN1:• CLDNN• Input1: 40-d LMFB• Input2: Waveform

• CNN2: • 50 layers deep fully CNN• Input1: 40-d LMFB• Input2: Waveform

• CNN3:• 50 layers deep fully CNN with gate on feature map• Input1: 40-d LMFB• Input2: Waveform

reshape

concat

1, T*160 40, T

128,1,1025,1,16

128,1,20,1,10

64,15,3,1,1

64,3,2,3,2

96,7,3,1,1

96,3,1,3,1

128,5,3,1,1

128,9,3,1,1

128,3,2,3,2

256,7,3,1,1

256,5,3,1,1

deconv

c1250 p350

700,1,1,1,1

c1250 p350

700,1,1,1,1

c1250 p350

2048,1,1,1,1

512,1,2,1,2

channel_out,kernel_h,kernel_w, stride_h, stride_w

3936,1,1,1,1

128, T

AMs with Our Best Front-end• Ensemble via the state posterior average and lattice combination

• 5-model ensemble (LF-BLSTM, LF-CNN-TDNN-LSTM, CNN-Ensemble)

59.4461.23

57.1357.38

Single-Array Rank-A

Official-AM LF-BLSTM LF-CNN-TDNN-LSTM CNN-Ensemble Ensemble

+15.3%+23.1%

+25.2%

+29.0%

NO data-augmentation

Language Model

LSTM-LM (Forward) and BLSTM-LM (Forward-Backward) are combined

One-Hot(12810)

Embedding(256)

Output(12810)

LSTM(512)

Projection(256)

LMs (Rank A vs. Rank B)

Single-Array

Official-LM Ours-LM

Not significant due to too little training data

LMs (Rank A vs. Rank B)

45.6545.49

Multiple-Array

Official-LM Ours-LM

Summary of Single-array

DEV EVAL

Rank-A

Official Baseline with ours Front-end with ours Front-end and Back-end

+37.6%

+36.6%

+16.6%

Front-end and back-end are equally important.

Our system is robust to evaluation set.

Summary of Multiple-array

DEV EVAL

Rank-A

Single-Array Multiple-Array

-0.5%+9.9%

Array selection didn’t work on the evaluation set

Summary: Details of Rank-A

Both systems of single-array and multiple-array are the best

Summary: Details of Rank-B

Both systems of single-array and multiple-array are the best

Take-Home Message• Front-End

• Dinner party scenario is extremely challenging• Most previous techniques in CHiME-4 are not working well • We design a solution to utilize both traditional and DL techniques

• Acoustic Model• Data augmentation is important with contributions from front-end• The new design of CLDNN achieves the best performance• Different deep architectures are complimentary

• Language Model• LSTM-LMs are not effective due to the limited training data

• Multiple-array• Our proposed array selection is effective on the development set• More analysis should be done on the evaluation set

Acknowledgement

• JSALT 2017• Team: Enhancement and Analysis of Conversational Speech

• DIHARD Challenge (Interspeech 2018 Special Session)

• Inspiration for front-end design of CHiME-5 Challenge

ThanksQ&A

the ustc-iflytek system for chime-5...

Documents

neural networks - ustc

geant4 user's guide - ustc

historical introduction - ustc

ustc news 2012

linear inversion - ustc

exercise 17 - ustc

predicate logic - ustc

slimline plug-in wireless door chime · the chime. if chime...

worm algorithms - ustc

particle cosmology - ustc

wireless door chime kit - the home depot wireless door chime...

chime live

the ustc-iflytek system for chime-5...

project ustc 3

capes chime

the ustc-nelslip systems for chime-6 challenge · •the...

amazon chime - developer guide - docs.aws.amazon.com ·...

documentatie chime

overview of ustc

overview of cssml yan jun, department manager anhui ustc...