shinji watanabe - wordpress.com...chime 3, 4 •noisy speech recognition using a tablet •recording...

Shinji Watanabe Mitsubishi Electric Research Laboratories (MERL)

Xiong Xiao Nanyang Technological University

Marc DelcroixNTT Communication Science Laboratories

3

1.

2.

3.

4.

44

List of abbreviations

LSTM Long Short-Term Memory (network)

LVCSR Large Vocabulary Continuous Speech Recognition

MAP Maximum A Posterior

MBR Minimum Bayes Risk

MCWF Multi-Channel Wiener Filter

ML Maximum Likelihood

MLLR Maximum Likelihood Linear Regression

MLLT Maximum Likelihood Linear Transformation

MMeDuSA Modulation of Medium Duration Speech Amplitudes

MMSE Minimum Mean Square Error

MSE Mean Square Error

MVDR Minimum Variance Distortionless Response (Beamformer)

NMF Non-negative Matrix Factorization

PNCC Power-Normalized Cepstral Coefficients

RNN Recurrent Neural Network

SE Speech Enhancement

sMBR state-level Minimum Bayes Risk

SNR Signal-to-Noise Ratio

SRP-PHAT Steered Response Power with the PHAse Transform

STFT Short Time Fourier Transform

TDNN Time Delayed Neural Network

TDOA Time Difference Of Arrival

TF Time-Frequency

VTLN Vocal Tract Length Normalization

VTS Vector Taylor Series

WER Word Error Rate

WPE Weighted Prediction Error (dereverberation)

ASR Automatic Speech Recognition

AM Acoustic Model

BF Beamformer

BLSTM Bidirectional LSTM

CMLLR Constrained MLLR (equivalent to fMLLR)

CNN Convolutional Neural Network

CE Cross Entropy

DAE Denoising Autoencoder

DNN Deep Neural Network

DOC Damped Oscillator Coefficients

DSR Distant Speech Recognition

D&S Delay and sum (Beamformer)

EM Expectation and Maximization

fDLR Feature space Discriminative Linear Regression

fMLLR Feature space MLLR (equivalent to CMLLR)

GCC-PHAT Generalized Cross Correlation with Phase Transform

GMM Gaussian Mixture Model

GSC Generalized Sidelobe Canceller

HMM Hidden Markov Model

IRM Ideal Ratio Mask

KL Kullback–Leibler (divergence/distance)

LCMV Linear Constrained Minimum Variance

LDA Linear Discriminant Analysis

LIN Linear Input Network

LHN Linear Hidden Network

LHUC Learning Hidden Unit Contribution

LM Language Model

LP Linear Prediction

55

Notations

Basic notation

𝑎 Scalar

𝐚 Vector

𝐀 Matrix

𝐴 Sequence

Signal processing

𝑥j[𝑛] Time domain 𝑗-microphone signal at sample 𝑛

𝑋j(𝑡, 𝑓) Frequency domain 𝑗-microphone coefficients at frame 𝑡 and frequency bin 𝑓

𝑗 ∈ {1,… , 𝐽} Channel index

𝑡 ∈ {1,… , 𝑇} Frame index

𝑓 ∈ {1,… , 𝐹} Frequency bin index

ℎ𝑗[𝑛] Time domain room impulse response

𝐻𝑗(𝑡, 𝑓) Frequency domain room impulse response

66

Notations

ASR

𝐨𝑡 ∈ 𝑅𝐷 𝐷-dimensional speech feature vector at frame 𝑡

𝑂 ≡ {𝐨𝑡 |𝑡 = 1,… , 𝑇} 𝑇-length sequence of speech features

𝑤𝑛 ∈ 𝑉 Word at 𝑛th position in a vocabulary 𝑉

𝑊 ≡ {𝑤𝑛|𝑛 = 1,… , 𝑁} 𝑁-length word sequence

𝑠𝑡 ∈ {1,… , 𝐽} HMM state index at frame 𝑡

𝑆 ≡ {𝑠𝑡 |𝑡 = 1,… , 𝑇} 𝑇-length sequence of HMM states

Model

𝜃 Model parameter

PDF

𝑁(⋅ |𝛍, 𝚺) (Real value) Gaussian distribution

𝑁𝑐(⋅ |𝛍, 𝚺) Complex Gaussian distribution

77

Notations

operation

𝑎∗ Complex conjugate

𝐀T Transpose

𝐀H Hermitian transpose

𝑡𝑟𝑎𝑐𝑒( ) Trace

𝒫 Principal eigenvector

𝐚 ∘ 𝐛 or 𝐀 ∘ 𝐁 Elementwise multiplication

𝜎( ) Sigmoid function

softmax( ) Softmax function

tanh( ) Tanh function

1010

1.1 Toward multi-microphone speech

recognition

1111

From pattern matching to probabilistic approaches

• 50s-60s • Initial attempts with template matching

• Recognition of digits or few phonemes

• 70s• Recognition of 1000 words

• First National projects (DARPA)

• Introduction of beam search

• 80s• Introduction of probabilistic model approaches (n-gram

language models, GMM-HMM acoustic models)

• First attempts with Neural Networks

• Launch of initial dictation systems (Dragon Speech)

(Juang’04)

1212

From research labs to outside world

• 90s• Discriminative training for acoustic models,

MLLR adaptation, VTS

• Development of Common toolkits (HTK)

• 2000s• Less breakthrough technologies

• New popular toolkits such as KALDI

• Launch of large scale applications

(Google Voice search)

• 2010s• Introduction of DNNs, RNN-LMs

• ASR used in more and more products (e.g. SIRI…)

(Juang’04)

1313

Evolution of ASR performance

1%

10%

100%

2000 201020051995 2015

Deep learning

By Microsoft

(Pallett’03, Saon’15, Xiong’16)

Switchboard task (Telephone conversation speech)

Wo

rd e

rro

r ra

te (

WER

)

2016

6.3%

1414

Impact of deep learning

• Great performance improvement• DNNs are more robust to input variations

bring improvements for all tasks including Large Vocabulary Continuous Speech Recognition (LVCSR) and Distant Speech Recognition (DSR)

• Robustness is still an issue (Seltzer’14, Delcroix’13)

• Speech enhancement/adaptation improve performance Microphone array, fMLLR, …

• Reshuffling the cards• Some technologies relying on GMMs became obsolete,

VTS, MLLR …

• Some technologies became less effective, VTLN, Single channel speech enhancement, …

• New opportunities, • Exploring long context information for recognition/enhancement

• Front-end/back-end joint optimization, …

1515

Towards distant ASR (DSR)

Distant microphone

e.g., Human-human comm.,

Human-robot comm.

Close-talking microphone

e.g., voice search

1616

Interest for DSR - Industry

Voiced controlled appliances

Robots

Game consoles

Home assistants

1717

Interest for DSR - Academia

ICSI meeting RATS

2005 2010 2016

A lot of community-driven research projects

1818

CHiME 1, 2

• Distant speech recognition in living room• Acoustic conditions

• Simulated distant speech

• SNR: -6dB to - 9dB

• # mics : 2

• CHiME 1: Command (Grid corpus)

+ noise (living room)

• CHiME 2 (WSJ): WSJ (5k) + noise (living room)

http://spandh.dcs.shef.ac.uk/chime_challenge

stereo

(Barker’13, Vincent’13)

1919

CHiME 3, 4

• Noisy speech recognition using a tablet• Recording conditions

• Noise types: Bus, Café, Street, Pedestrian

• # mics: 6 (CHiME3); 1, 2, 6 (CHiME4)

• Simulated and real recordings

• Speech• Read speech (WSJ (5k))

http://spandh.dcs.shef.ac.uk/chime_challenge

(Barker’15)

2020

REVERB

• Reverberant speech recognition• Recording conditions

• Reverberation (RT 0.2 to 0.7 s.)

• Noise type: stationary noise (SNR ~20dB)

• # mics: 1, 2, 8

• Simulated and real recordings (MC-WSJ-AV)

• Speech• Read speech (WSJ CAM0 (5k))

http://reverb2014.dereverberation.com

(Kinoshita’13, Lincoln’05)

2121

AMI

• Meeting recognition corpus• Recording conditions

• Multi-speaker conversations

• Reverberant rooms

• # mics: 8

• Real recordings

• Speech• Spontaneous meetings (8k)

http://corpus.amiproject.org/

(Carletta’05)

2222

ASpIRE

• Large vocabulary reverberant speech• Recording conditions

• Reverberant speech

• 7 different rooms (classrooms and office rooms) with various shapes, sizes, surface properties, and noise sources

• # mics: 1 or 6

• Speech• Training data: Fisher corpus (2000 h of telephone speech)

https://www.iarpa.gov/index.php/working-with-iarpa/prize-challenges/306-automatic-speech-in-reverberant-environments-aspire-challenge

(Harper’15)

2424

Summary of tasks

Vocab size

Amount of training data

Real/Simu

Type of distortions #mics

Mic-speakerdistance

Ground truth

ASpIRE 100K ~ 2000 h Real Reverberation 8/1 N/A N/A

AMI 11K 75 h Real Multi-speaker conversationsReverberation and noise

8 N/A Headset

CHiME1 50 17,000 utt. Simu Non-stationary noise recorded in a living room (SNR -6dB – 9dB)Reverberation from recorded impulse responses

2 2m Clean

CHiME2 (WSJ)

5K 7138 utt. (~ 15 h) Simu Same as CHiME1 2 2m Clean

CHiME3 5K 8738 utt. (~ 18 h) Simu + Real

Non-stationary noise in 4 environments

6 0.5m Close talkmic.

CHiME4 5K 8738 utt. (~ 18 h) Simu + Real

Non-stationary noise in 4 environments

6/2/1 0.5m Close talkmic.

REVERB 5K 7861 utt.. (~ 15 h) Simu + Real

Reverberation in different living rooms (RT60 from 0.25 to 0.7 sec.) + stationary noise (SNR ~ 20dB)

8/2/1 0.5 m – 2m Clean /Headset

2626

Content of the tutorialSingle⇒Multi-microphone speech recognition

Feature extraction

Recognizer

Acousticmodel

Languagemodel

Lexicon

My name is …

𝑊𝑂

……

𝑋(𝑡, 𝑓) 𝐨𝑡

Model adaptation

Multichannel speech enhancement front-end (Section 2):- Dereverberation- Noise reduction

* We call multichannel instead of multi-microphone by following a convention of signal processing

2727

Content of the tutorialSingle⇒Multi-microphone speech recognition

Feature extraction

Recognizer

Acousticmodel

Languagemodel

Lexicon

My name is …

𝑊𝑂

……SE front-end

𝑌𝑗(𝑡, 𝑓)𝑋(𝑡, 𝑓) 𝐨𝑡

Model adaptation


* We call multichannel instead of multi-microphone by following a convention of signal processing

2828

Content of the tutorialMulti-microphone speech recognition

Feature extraction

Recognizer

Acousticmodel

Languagemodel

Lexicon

My name is …

𝑊𝑂

……SE front-end


Model adaptation

Joint training paradigm (Section 3):- Unify speech enhancement and speech recognition

2929

Single-microphone speech recognition(Conventional)

Feature extraction

Recognizer

Acousticmodel

Languagemodel

Lexicon

My name is …

𝑊𝑂

……


Model adaptation

3030

1.2 Introduction of single-microphonespeech recognition

3131

Towards distant ASR (DSR)

3232

Single-microphone speech recognition(Conventional)

Feature extraction

Recognizer

Acousticmodel

Languagemodel

Lexicon

My name is …

𝑊𝑂

……


Model adaptation

3333

Speech recognition

Recognizer

“That’s another story”

𝑋 ≡ {𝑥[𝑛]|𝑛 = 1,… , 𝑁} 𝑊 ≡ {𝑤𝑢 ∈ 𝑉|𝑢 = 1, … , 𝑈}

Featureextraction

……

𝑂 ≡ {𝐨𝑡|𝑡 = 1,… , 𝑇}

• Feature extraction (deterministic)

𝑂 = 𝑓(𝑋)

• Bayes decision theory (MAP)𝑊 = argmax

W𝑝 𝑊 𝑂

• 𝑝 𝑊 𝑂 : Posterior distribution

• Find most probable word sequence 𝑊 given feature sequence 𝑂

3434

Acoustic, lexicon, and language models

Recognizer

Acousticmodel

Languagemodel

Lexicon


𝑊

……

𝑂

• Use subword sequence 𝑆 ≡ {𝑠𝑡|𝑡 = 1…𝑇} and factorize 𝑝 𝑊 𝑂

𝑊 = argmaxW

𝑝 𝑊 𝑂 =argmaxW

𝑆

𝑝 𝑂 𝑆 𝑝 𝑆 𝑊 𝑝(𝑊)

• 𝑝 𝑂 𝑆 : Acoustic model

• 𝑝 𝑆 𝑊 : Lexicon model (obtained by hand-crafted pronunciation dictionary)

• 𝑝(𝑊): Language model

• 𝑠𝑡: subword composed of phonemes (/a/, /k/, …) or HMM states (/a_1/, …)

Featureextraction

3535

Feature extractionLog mel filterbank extraction

• Converts a speech signal to a sequence of speech features more suited for ASR, typically log mel filterbank coefficients

Feature extraction

ො𝑥[𝑛]: waveform 𝑂 ≡ {𝐨𝑡 |𝑡 = 1,… , 𝑇}Log mel feature

STFT log( ∙ )Mel filtering∙ 2

3636

Acoustic model

• Hidden Markov model (per phoneme)

• Factorize the probabilistic distribution into the product form of 𝑡 with conditional independence assumptions (1st order Markov assumption and 𝑝 𝑂 𝑠𝑡 ≈ 𝑝 𝐨𝑡 𝑠𝑡 )

𝑝 𝑂 𝑆 ≈ෑ

𝑡=1

𝑇

𝑝 𝐨𝑡 𝑠𝑡 𝑝(𝑠𝑡|𝑠𝑡−1)

• 𝑝 𝐨𝑡 𝑠𝑡 : Likelihood function

• 𝑝(𝑠𝑡|𝑠𝑡−1) : State transition probability

• Likelihood function 𝑝 𝐨𝑡 𝑠𝑡 = 𝑘 is computed by

• Gaussian mixture model

𝑝 𝐨𝑡 𝑠𝑡 = 𝑘 =

𝑙=1

𝐿

𝜔𝑘𝑙𝑁(𝐨𝑡 𝛍𝑘𝑙 , 𝚺𝑘𝑙

• 𝑁(𝐨|𝛍, 𝚺): Gaussian distribution with mean vector 𝝁and covariance matrix 𝚺, 𝜔𝑘𝑙: Mixture weight

• Parameters are estimated by the Expectation Maximization (EM) algorithm

• Deep neural network

𝑠𝑡−1 𝑠𝑡 𝑠𝑡+1

𝐨𝑡−1 𝐨𝑡 𝐨𝑡+1

𝑝(𝑠𝑡|𝑠𝑡−1)

𝑝(𝐨𝑡|𝑠𝑡)

3737

Expectation Maximization (EM) algorithm

• Maximum likelihood criterion

Θ = argmaxΘ𝑝 𝑂; Θ = argmaxΘ

𝑆

𝑝(𝑂, 𝑆; Θ)

• Hard to estimate due to the summation over all possible 𝑆

• Auxiliary function

𝑄 Θ, Θ =

𝑆

𝑝 𝑆 𝑂; Θ log 𝑝(𝑂, 𝑆; Θ)

Θ ← argmaxΘ𝑄(Θ, Θ)

• 𝑝 𝑆 𝑂, Θ : posterior distribution of latent variable 𝑆 obtained by the previously estimated

parameter Θ

• The update rule is iteratively performed, i.e., Θ𝜏=1 → ⋯ → Θ𝜏→ Θ𝜏+1, and always satisfies

𝑝 𝑂; Θ𝜏 ≤ 𝑝 𝑂; Θ𝜏+1 ⇒ Obtain a local optimum parameter

• Viterbi approximation: 𝑝 𝑆 𝑂; Θ ≈ 𝛿 𝑆 − መ𝑆 , where መ𝑆 = argmax𝑆

𝑝(𝑆|𝑂; Θ)

• Examples

1. GMM: 𝑄 Θ, Θ = σ𝑙,𝑡 𝛾𝑙,𝑡log 𝜔𝑙𝑁(𝐨𝑡 𝛍𝑙 , 𝚺𝑙 , where 𝛾𝑙,𝑡 = 𝑝(𝑣𝑡 = 𝑙|𝐨𝑡; Θ)

2. HMM: 𝑄 Θ, Θ = σ𝑘,𝑙,𝑡 𝛾𝑘,𝑙,𝑡log 𝜔𝑘𝑙𝑁(𝐨𝑡 𝛍𝑘𝑙 , 𝚺𝒌𝑙 + σ𝑘,𝑡 𝛾𝑘,𝑘′,𝑡log 𝑎𝑘,𝑘′ , where 𝛾𝑘,,𝑙,𝑡 =

𝑝 𝑠𝑡 = 𝑘, 𝑣𝑡 = 𝑙 𝐨𝑡; Θ , 𝛾𝑘,𝑘′,𝑡 = 𝑝(𝑠𝑡 = 𝑘′|𝑠𝑡 = 𝑘, 𝐨𝑡; Θ)

Note that we can avoidthe summation over all possible 𝑆

3838

Deep neural network for acoustic model

• Hidden Markov model

𝑝 𝑂 𝑆 ≈ෑ

𝑡=1

𝑇

𝑝 𝐨𝑡 𝑠𝑡 𝑝(𝑠𝑡|𝑠𝑡−1)

• Likelihood function is computed by

• Pseudo likelihood trick

𝑝 𝐨𝑡 𝑠𝑡 =𝑝 𝑠𝑡 𝐨𝑡 𝑝(𝐨𝑡)

𝑝(𝑠𝑡)∝𝑝 𝑠𝑡 𝐨𝑡𝑝(𝑠𝑡)

• 𝑝 𝑠𝑡 𝐨𝑡 : posterior distribution of HMM state 𝑠𝑡 given feature 𝐨𝑡, 𝑝(𝑠𝑡) : prior distribution of 𝑠𝑡

• We can easily build a frame-level classifier to obtain 𝑝 𝑠𝑡 𝐨𝑡 when we prepare parallel data 𝐨𝑡 , 𝑠𝑡 𝑡

• Log linear model or deep neural network (DNN)

• DNN: Very powerful classifier by considering correlations, and non-linearity of input feature

vectors (discussed later)

• With the pseudo likelihood trick, we can combine a frame-level DNN classifier and hidden

Markov model to obtain a sequence likelihood function 𝑝(𝑂|𝑆)

3939

• Trained using error back-propagation

• Training criterion, cross entropy, MMSE, State-level MBR, …

Basics of deep neural networks

Output layer (𝑙 = 𝐿)

Input layer (𝑙 = 0)

Hidden layers (𝑙)

𝐚𝑡𝑙 = 𝐖𝑙𝐡𝑡

𝑙−1 + 𝐛𝑙

𝐡𝑡𝑙 = 𝜎 𝐚𝑡

𝑙

Activation function 𝜎 ∙

0

1 1

0

Sigmoid Relu

𝐡𝑡𝑙

𝑝 𝑠𝑡 = 𝑘 𝐨𝑡 = ℎ𝑡,𝑘𝐿 = softmax(𝐚𝑡

𝐿) 𝑘

4040

DNN-based acoustic model(Hinton’12, Mohamed’12)

• Minimize cross entropy

𝐽(𝜃) = −

𝑡

𝑘

𝜏𝑡,𝑘 log ℎ𝑡,𝑘𝐿 (𝜃)

• 𝜏𝑡,𝑘 : Target label

• ℎ𝑡,𝑘𝐿 : Network output

• 𝜃 = {𝐖𝑙 , 𝐛𝑙}𝑙=1𝐿 : Network parameters

• Use large amount of speech training data with the associated HMM state alignments

• Optimization using gradient decent𝜃 = argmin𝜃𝐽 𝜃

𝜃 = 𝜃 + 𝜂𝜕𝐽 𝜃

𝜕𝜃• 𝜂: Learning rate

• Have to compute gradient

• Error backpropagation

Output HMM state

Log mel filterbank + 11 context frames

~7h

idd

en la

yers

,2

04

8 u

nit

s

1,000 ~ 10,000 units

･･････

Input speech features

a i u w N・・・

4141

• Forward• Propagate feature through network 𝐡(𝐿)

• Compute error signal𝛅 𝐿 = (𝐡 𝐿 − 𝛕)

• Back-propagate error

• sigmoid: 𝛅 𝑙−1 = ( W𝑙 T𝛅 𝒍 ) ◦ a

𝑙−1◦ (𝟏 − a

𝑙−1)

• softmax: 𝛅 𝐿−1 = 𝛅 𝐿 ◦ a𝐿−1

◦ (𝟏 − a𝐿−1

)

• Compute gradient

𝜕𝐽 𝜃

𝜕𝐖𝑘(𝑙)

= 𝛅 𝑙 h𝑙−1 T

𝜕𝐽 𝜃

𝜕𝐛(𝑙) = 𝛅 𝑙

• Update parameters

W𝑘(𝑙)

= W𝑘(𝑙)+ η

𝜕𝐽 𝜃

𝜕𝐖𝑘(𝑙)

b(𝑙)

= b(𝑙)+ η

𝜕𝐽 𝜃

𝜕𝐛(𝑙)

･･････


・・・

DNN-training with error back-propagation

Output: 𝐡(𝐿)0.1 0.8 0.1 0 0

a i u w N HMM State

4242

･･････


・・・


Output: 𝐡(𝐿)0.1 0.8 0.1 0 0

a i u w N

0 1 0 0 0 Label: 𝛕

HMM State





𝑙−1◦ (𝟏 − a

𝑙−1)


◦ (𝟏 − a𝐿−1

)


𝜕𝐽 𝜃

𝜕𝐖𝑘(𝑙)


𝜕𝐽 𝜃

𝜕𝐛(𝑙) = 𝛅 𝑙


W𝑘(𝑙)

= W𝑘(𝑙)+ η

𝜕𝐽 𝜃

𝜕𝐖𝑘(𝑙)

b(𝑙)

= b(𝑙)+ η

𝜕𝐽 𝜃

𝜕𝐛(𝑙)

4343

･･････


・・・


Output: 𝐡(𝐿)0.1 0.8 0.1 0 0

a i u w N

0 1 0 0 0 Label: 𝛕

HMM State





𝑙−1◦ (𝟏 − a

𝑙−1)


◦ (𝟏 − a𝐿−1

)


𝜕𝐽 𝜃

𝜕𝐖𝑘(𝑙)


𝜕𝐽 𝜃

𝜕𝐛(𝑙) = 𝛅 𝑙


W𝑘(𝑙)

= W𝑘(𝑙)+ η

𝜕𝐽 𝜃

𝜕𝐖𝑘(𝑙)

b(𝑙)

= b(𝑙)+ η

𝜕𝐽 𝜃

𝜕𝐛(𝑙)

4444

Language model

• 𝑁-gram language model

𝑝 𝑊 = 𝑤1, … , 𝑤𝑈 = ෑ

𝑢=1

𝑈

𝑝 𝑤𝑢 𝑤1:𝑢−1 ≈ෑ

𝑢=1

𝑈

𝑝(𝑤𝑢|𝑤𝑢−𝑁:𝑢−1)

• 𝑁 − 1 order Markov assumption

• Often encounter the zero occupancy problem

⇒ Kneser-Ney or other smoothing techniques

• Easily combined with a decoder

• Recurrent neural network language model (Mikolov’10)

𝑝 𝑊 = ෑ

𝑢=1

𝑈

𝑝 𝑤𝑢 = 𝑤 𝑤1:𝑢−1 =ෑ

𝑢=1

𝑈

[softmax 𝐚𝑢𝐿 (𝑤1:𝑢−1 ]𝑤

• No Markov assumption

• High performance

• RNN or LSTM

• Difficult to combine it with decoder, and N-best rescoring instead

4545

Acoustic, lexicon, and language models

Recognizer

Acousticmodel

Languagemodel

Lexicon


𝑊

……

𝑂

• Use subword sequence 𝑆 ≡ {𝑠𝑡|𝑡 = 1…𝑇} and factorize 𝑝 𝑊 𝑂

𝑊 = argmaxW

𝑝 𝑊 𝑂 =argmaxW

𝑆

𝑝 𝑂 𝑆 𝑝 𝑆 𝑊 𝑝(𝑊)

• 𝑝 𝑂 𝑆 : HMM-GMM or HMM-DNN hybrid

• 𝑝 𝑆 𝑊 : Lexicon model

• 𝑝(𝑊): 𝑁-gram or recurrent neural network language model

• Search most probable word sequence 𝑊 by a decoder

Featureextraction

4646

1.3 Review of state-of-the-art

single channel ASR

4747

Reverberation

Background noise

Interfering

speaker

Distant mic

Challenges of DSR

4848

Recent achievements

• REVERB 2014 (WER)

• CHiME-4 2016 (WER)

Baseline (GMM)

Multi-mic front-end + robust back-end

Robust back-end

48.9 %

22.2 %

9.0 %

Robust back-end


11.5 %

7.5 %

3.0 %

Baseline (DNN)

4949

Robust back-end (Single channel speech recognition)

Feature extraction

Recognizer

Acousticmodel

Languagemodel

Lexicon

My name is …

𝑊𝑂

……SE front-end


Model adaptation

- Feature extraction- Acoustic model- Model adaptation

5050

1.3.1 Feature extractionFeature

extractionRecognizer

Acousticmodel

Languagemodel

Lexicon

My name is …

𝑊𝑂

……SE front-end


Model adaptation

5151

Feature extraction

• Log mel filterbank

• Spectrum analysis• Power extraction (disregard phase)• Emphasize low-frequency power with perceptual knowledge (Mel scale)• Dynamic range control • Cepstrum Mean and Variance Normalization (CMVN)

• ETSI Advanced front-end (ETSI707)

• Developed at the Aurora project• Time domain Wiener-filtering (WF) based noise reduction

STFT ∙ 2 log( ∙ )Mel filtering𝑦[𝑛] 𝐨𝑡

STFTPSD

estimationMel

filteringWiener Filter

Design𝑦[𝑛]

VAD

MelIDCT

ො𝑦 [𝑛]

Applyfilter

CMVN 𝐨𝑡

5252

Gammatone Filtering based features

• Human auditory system motivated filter

• Power-Normalized Cepstral Coefficients (PNCC) (Kim’12)

• Replace log ∙ to power ∙ 0.1, frequency-domain Gammatone filtering, Medium-duration Power bias subtraction

• Time-domain Gammatone filtering (e.g., Schulter’09, Mitra’14)• Can combine amplitude modulation based features

• Gammatone filtering and amplitude modulation based features (Damped Oscillator Coefficients (DOC), Modulation of Medium Duration Speech Amplitudes (MMeDuSA)) showed significant improvement for CHiME3 task

MFCC DOC MMeDuSA

CHiME 3 Real Eval(MVDR enhanced signal)

8.83 5.91 6.62 (Hori’15)

STFT ∙ 2 Power biassubtraction

Gammatonefilter (freq.)

∙ 0.1

5353

(Linear) Feature transformation

• Linear Discriminant Analysis (LDA)

• Concatenate contiguous features, i.e., 𝐱t = 𝐨𝑡−𝐿𝑇 , … , 𝐨𝑡 ,

𝑇 … , 𝐨𝑡+𝐿𝑇 𝑇

• ෝ𝐨𝑡LDA = 𝐀LDA𝐱𝑡

• Estimate a transformation to reduce the dimension with discriminant analysis

→ Capture long-term dependency

• Semi-Tied Covariance (STC)/Maximum Likelihood Linear Transformation (MLLT)

• 𝑁(𝐨𝑡|𝛍𝑘𝑙 , 𝚺𝑘𝑙diag

) → 𝑁 𝐨𝑡 𝛍𝑘𝑙 , 𝚺𝑘𝑙full with the following relationship

𝚺𝑘𝑙full = 𝐀STC𝚺𝑘𝑙

diag𝐀STC

𝑇

• Estimate 𝐀STC by using maximum likelihood

• During the recognition, we can evaluate the following likelihood function with diagonal covariance and feature transformation

𝑁 ෝ𝐨𝑡STC 𝐀STC𝛍𝑘𝑙 , 𝚺𝑘𝑙

diag, where ෝ𝐨𝑡

STC = 𝐀STC𝐨𝑡

5454

(Linear) Feature transformation, Cont’d

• Feature-space Maximum Likelihood Linear Regression (fMLLR)

• Affine transformation: ෝ𝐨𝑡 = 𝐀fM𝐨𝑡 + 𝐛fM

• Estimate transformation parameter𝐀fMand 𝐛fMwith maximum likelihood estimation

𝑄 𝐀fM, 𝐛 = σ𝑘,𝑡,𝑙 𝛾𝑡,𝑘,𝑙 (log 𝐀fM + log𝑁(𝐀fM𝐨𝑡 + 𝐛fM|𝛍𝑘𝑙 , 𝚺k𝑙))

• LDA, STC, fMLLR are cascadely combined, i.e.,

ෝ𝐨𝑡=𝐀fM(𝐀STC(𝐀LDA 𝐨𝑡−𝐿

𝑇 , … , 𝐨𝑡,𝑇 … , 𝐨𝑡+𝐿

𝑇 𝑇)) + 𝐛fM

• Effect of feature transformation with distant ASR scenarios GMM

• LDA, STC, and fMLLR are cascadely used, and yield significant improvement

• All are based on GMM-HMM, but still applicable to DNN as feature extraction

• MFCC is more appropriate than Filterbank feature, as MFCC matches GMM

(Tachioka’13,’14)

MFCC, Δ, ΔΔ LDA, STC, fMLLR

CHiME-2 44.04 33.71

REVERB 39.56 30.88

5555

1.3.2 Robust acoustic models

Feature extraction

Recognizer

Acousticmodel

Languagemodel

Lexicon

My name is …

𝑊𝑂

……SE front-end


Model adaptation

5656

DNN acoustic model

• Non-linear transformation of (long) context features by concatenating contiguous frames

→ Very powerful for noise robust ASR

• Cross entropy criterion 𝐽ce(𝜃)

𝐽ce(𝜃) = −

𝑡

𝑘

𝜏𝑡,𝑘 log ℎ𝑡,𝑘𝐿 (𝜃)

• There are several other criteria

･･････


a i u w N・・・

･･････𝐨𝑡 = [𝐱𝑡−𝐿, … , 𝐱𝑡, … , 𝐱𝑡+𝐿]

Long context! (usually 11 frames)

5757

• Sequence discriminative criterion 𝐽seq(𝜃)

𝐽seq 𝜃 =

𝑢

𝑊

𝐸 𝑊,𝑊𝑢 𝑝(𝑊|𝑂𝑢)

• 𝐸(𝑊,𝑊𝑢) is a sequence error between reference 𝑊𝑢 and hypothesis 𝑊• State-level Minimum Bayes Risk (sMBR)

･･････


GMM DNN CE DNN sMBR

CHiME3 baseline v2

23.06 17.89 15.88

･･････𝐨𝑡 = [𝐱𝑡−𝐿, … , 𝐱𝑡, … , 𝐱𝑡+𝐿]

𝑊 = My name is …. (hypothesis)

LVCSR decoder

𝑊𝑢 = My name was …. (reference)

Sequence discriminative criterion

Compute sequence level errors

5858

Multi-task objectives

• Use both MMSE and CE criteria• 𝑋 as clean speech target

• 𝑇 as transcription

• Network tries to solve both enhancement and recognition

• 𝜌 controls the balance between the two criteria

• Will be used to regularize a particular layer at joint training in Section 3

･･････

a N・・・

ℎ𝑡,𝑘MSE,𝐿

ℎ𝑡,𝑘CE,𝐿

CE Multi-task𝝆 = 𝟎. 𝟗𝟏

REVERB RealData 32.12 31.97

(Giri’15)

5959

Toward further long contextTime Delayed Neural Network (TDNN)

Convolutional Neural Network (CNN)

Recurrent Neural Network (RNN)• Long Short-Term Memory (LSTM)

6060

Time delayed neural network (TDNN)

• Deal with “very” long context (e.g., 17 frames)

• Difficult to train the first layer matrix due to vanishing gradient

𝐱𝑡−8 𝐱𝑡 𝐱𝑡+8… …

𝐡𝑡5

(Waibel’89, Peddinti’15)

6161



𝐡𝑡5


• Original TDNN

– Consider short context (e.g., [-2, 2]), but expand context at each layer

𝐡𝑡1 = 𝜎(𝐀𝟏 𝐱𝑡−2, 𝐱𝑡−1, 𝐱𝑡, 𝐱𝑡+1, 𝐱𝑡+2 + 𝐛1)

𝐡𝑡2 = 𝜎 𝐀2 𝐡𝑡−2

1 , 𝐡𝑡−11 , 𝐡𝑡,

1 𝐡𝑡+11 , 𝐡𝑡+2

1 + 𝐛2

𝐡𝑡3 = ⋯

Very large computational cost

6262


• Original TDNN• Consider short context (e.g., [-2, 2]),

but expand context at each layer𝐡𝑡1 = 𝜎(𝐀𝟏 𝐱𝑡−2, 𝐱𝑡−1, 𝐱𝑡, 𝐱𝑡+1, 𝐱𝑡+2 + 𝐛1)𝐡𝑡2 = 𝜎 𝐀2 𝐡𝑡−2

1 , 𝐡𝑡−11 , 𝐡𝑡,

1 𝐡𝑡+11 , 𝐡𝑡+2

1 + 𝐛2

𝐡𝑡3 = ⋯

Very large computational cost

• Subsampled TDNN (Peddinti’15)• Subsample frames in the context

expansion𝐡𝑡2 = 𝜎(𝐀2 𝐡𝑡−2

1 , 𝐡𝑡+21 + 𝐛2)

• Efficiently compute long context network

DNN TDNN

ASpIRE 33.1 30.8

AMI 53.4 50.7



𝐡𝑡5

6363

Convolutional Neural Network (CNN)

• Represents the input as time-frequency feature map 𝑜𝑡,𝑝,𝑞 (we can also use multiple maps one for static, delta and delta-delta features), where 𝑝 and 𝑞are indexes along the time and frequency axes of the feature maps

• Time-dimensional feature maps can capture long context information

REVERB: 23.5 (DNN) → 22.4 (CNN-DNN) (Yoshioka’15a)

𝑎𝑡,𝑝,𝑞(𝑚)

Output map

ℎ𝑡,𝑝,𝑞(𝑚)

Input map𝐨𝑡

Filters

𝑤𝑝,𝑞𝑚, 𝑏(𝑚)

Pooling map

𝑜𝑡,𝑝,𝑞

𝑎𝑡,𝑝,𝑞(𝑚)

= 𝑤𝑝,𝑞(𝑚)

∘ 𝑜𝑡,𝑝,𝑞 + 𝑏(𝑚)

ℎ𝑡,𝑝,𝑞(𝑚)

= 𝜎 𝑎𝑡,𝑝,𝑞(𝑚)

ℎ𝑡,𝑞(𝑚)

= max𝑝

ℎ𝑡,𝑝,𝑞𝑚

(max pooling)

𝑞

𝑝

6464

RNN/LSTM acoustic model

• RNN can also capture the long-term distortion effect due to reverberation and noise

• RNN/LSTM can be applied as an acoustic model for noise robust ASR (Weng’14, Weninger’14)

Input: noisy speech features

𝒐𝑡𝒐𝑡−1RNN or LSTM

Output HMM state

a i u w N・・・

a i u w N・・・

DNN RNN

Aurora4 13.33 12.74

CHiME2 29.89 27.70

6565

Practical issues

6666

The importance of the alignments

• DNN CE training needs frame-level label 𝜏𝑡,𝑘 obtained by Viterbi algorithm

𝐽CE(𝜃) = −

𝑡

𝑘

𝜏𝑡,𝑘 log ℎ𝑡,𝑘𝐿

• However, it is very difficult to obtain precise label 𝜏𝑡,𝑘 for noisy speech

• How to deal with the issue?• Re-alignment after we obtain DNN several times

• Sequence discriminative training can mitigate this issue (however, since we use CE as an initial model, it is difficult to recover this degradation)

• Parallel clean data alignment

if available

sil sil?

Noisy alignment Clean alignment

CHiME2 29.89 24.75

(Weng’14)

6767

Degradation due to enhanced features

• Which features we should use for training acoustic models?

• Noisy features: 𝐨𝑡𝑛𝑜𝑖𝑠𝑦

= FE(𝑌)

• Enhanced features: 𝐨𝑡𝑒𝑛ℎ = FE( 𝑋)

Feature extraction

SE front-end

𝑌𝑗(𝑡, 𝑓)

𝑋(𝑡, 𝑓) 𝐨𝑡𝑒𝑛ℎ

Feature extraction

𝑌𝑗(𝑡, 𝑓) 𝐨𝑡𝑛𝑜𝑖𝑠𝑦

Training Testing WER (%)

Noisy 𝐨𝑡𝑛𝑜𝑖𝑠𝑦

Noisy 𝐨𝑡𝑛𝑜𝑖𝑠𝑦 23.66

Noisy 𝐨𝑡𝑛𝑜𝑖𝑠𝑦 Enhanced 𝐨𝑡

𝑒𝑛ℎ 14.86

Enhanced 𝐨𝑡𝑒𝑛ℎ Enhanced 𝐨𝑡

𝑒𝑛ℎ ????

CHiME 3 Real Eval

6868

Degradation due to enhanced features

• Which features we should use for training acoustic models?

• Noisy features: 𝐨𝑡𝑛𝑜𝑖𝑠𝑦

= FE(𝑌)

• Enhanced features: 𝐨𝑡𝑒𝑛ℎ = FE( 𝑋)

Training Testing WER (%)

Noisy 𝐨𝑡𝑛𝑜𝑖𝑠𝑦

Noisy 𝐨𝑡𝑛𝑜𝑖𝑠𝑦 23.66

Noisy 𝐨𝑡𝑛𝑜𝑖𝑠𝑦 Enhanced 𝐨𝑡

𝑒𝑛ℎ 14.86

Enhanced 𝐨𝑡𝑒𝑛ℎ Enhanced 𝐨𝑡

𝑒𝑛ℎ 16.17

Re-training with enhancedfeatures degrades the ASR performance!!• Noisy data training are robust

for distorted speech (?)

CHiME 3 Real Eval

Feature extraction

SE front-end

𝑌𝑗(𝑡, 𝑓)

𝑋(𝑡, 𝑓) 𝐨𝑡𝑒𝑛ℎ

Feature extraction

𝑌𝑗(𝑡, 𝑓) 𝐨𝑡𝑛𝑜𝑖𝑠𝑦

6969

Remarks

• Noise robust feature and linear feature transformation are effective• Effective for both GMM and DNN acoustic modeling

• Deep learning is effective for noise robust ASR• DNN with sequence discriminative training is still powerful

• RNN, TDNN, and CNN can capture the long-term dependency of speech, and are more effective when dealing with reverberation and complex noise

• We can basically use standard acoustic modeling techniques even for distant ASR scenarios

• However, need special cares for• Alignments

• Re-training with enhanced features

7070

1.3.3 Acoustic model adaptation

Feature extraction

Recognizer

Acousticmodel

Languagemodel

Lexicon

My name is …

𝑊𝑂

……SE front-end


Model adaptation

7171

Importance of acoustic model adaptation

0 2 4 6 8 10

CHiME 3

REVERB

WER (%)

W/O adaptation

W Adaptation

(Delcroix’15a)

(Yoshioka’15b)

7272

Acoustic model adaptation

• DNN is very powerful so why do we need adaptation?

• Unseen test condition due to limited amount of training data

• Model trained on large amount of data may be good on average but not optimal for a specific condition

Acousticmodel

Training dataVarious speakers & environments

Testing dataSpecific speaker & environment

Acousticmodel

Adapt

7373

Supervised/Unsupervised adaptation

• Supervised adaptation• We know what was spoken

• There are transcriptions associated with adaptation data

• Unsupervised adaptation• We do not know what was spoken

• There are no transcriptions

7474

Supervised/Unsupervised adaptation

• Supervised adaptation• We know what was spoken

• There are transcriptions associated with adaptation data

• Unsupervised adaptation• We do not know what was spoken

• There are no transcriptions

7575

DNN adaptation techniques

• Model adaptation• Retraining

• Linear transformation of input or hidden layers (fDLR, LIN, LHN, LHUC)

• Adaptive training (Cluster/Speaker adaptive training)

• Auxiliary features• Auxiliary features

• Noise aware training

• Speaker aware training

• Context adaptive DNN

7676









7777

Unsupervised labels estimation

• 1st pass• Decode adaptation data with an existing ASR system

• Obtain estimated labels, Ƹ𝜏𝑡,𝑘

Adaptation speech data

Ƹ𝜏𝑡,𝑘Feature extraction

Recognizer

Acousticmodel

Languagemodel

Lexicon

𝑂

……

Speech enhancement

𝑦𝑗[𝑛]ො𝑥[𝑛] 𝐨𝑡

7878

Retraining

• Retrain/adapt acoustic model parameters given the estimated labels with error backpropagation (Liao’13)

• Prevent modifying too much the model• Small learning rate

• Small number of epochs (early stopping)

• Regularization (e.g. L2 prior norm (Liao’13), KL (Yu’13))

• For large amount of adaptation data, retraining all or part of the DNN (e.g. lower layers)

Output: ℎ𝑡,𝑘𝐿0.1 0.8 0.1 0 0

a i u w N

0 1 0 0 0 Label: Ƹ𝜏𝑡,𝑘

HMM State

・・・

･･････


7979

Linear input network (LIN)

• Add a linear layer that transforms the input features

• Learn the transform with error backpropagation

(Neto’95)

･･････


a i u w N・・・

･･････

a i u w N・・・

ෝ𝐨𝑡 = 𝐀 𝐨𝑡 + 𝐛(no activation)

8080

Linear hidden network (LHN)

• Insert a linear transformation layer inside the network

መ𝐡𝑡𝑙 = 𝐀 𝐡𝑡

𝑙 + 𝐛(no activation)

(Gemello’06)

･･････


a i u w N・・・

･･････

a i u w N・・・

8181

Learning hidden unit contribution (LHUC)

• Similar to LHN but with diagonal matrix

Fewer parameters

･･････


a i u w N・・・

･･････

a i u w N・・・

መ𝐡𝑡𝑙 = 𝐀 𝐡𝑡

𝑙

𝐀 =𝑎1 ⋯ 0⋮ ⋱ ⋮0 ⋯ 𝑎𝑁

(Swietojanski ’14b)

8282

Room adaptation for REVERB (RealData)

Speech processed with WPE (1ch)Amount of adaptation data ~9 minBack-end:- DNN with 7 hidden layers- Trigram LM

Results from (Delcroix’15a)

Adap WER (%)

- 24.1

1st 21.7

All 22.1

LIN 22.1

8383

Model adaptation

Can adapt to conditions unseen during training

Computationally expensive + processing delayRequires 2 decoding step

Data demandingRelatively large amount of adaptation data needed

8484









8585

Auxiliary features based adaptation

• Exploit auxiliary information about speaker or noise

• Simple way: • Concatenate auxiliary features to input

features

• Weights for auxiliary features learned during training

Auxiliary Features represents e.g.,

• Speaker aware (i-vector, Bottleneck feat.) (Saon’13)

• Noise aware (noise estimate) (Seltzer’13)

• Room aware (RT60, Distance, …) (Giri’15)

･･････Input speech features

a i u w N・・・

8686

Speaker adaptation

• Speaker i-vectors or bottleneck features have shown to improve performance for many tasks

• Other features such as noise or room parameters have also been shown to improve performance

Auxiliary feature AURORA 4 REVERB

- 9.6 % 20.1 %

i-vector 9.0 % 18.2 %

Speaker ID Bottleneck 9.3 % 17.4 %

Results from (Kundu’15)

8787

Auxiliary features-based adaptation

Rapid adaptation Auxiliary features can be computed per utterance (~10 sec. or less)

Computationally friendlyNo need for the extra decoding step

(Single pass unsupervised adaptation)

Does not extend to unseen conditionsRequires training data covering all test cases

8888

Summary of single-channel distant speech recognition

• Many effective techniques are developed for general speech recognition• Feature transformation

• DNN acoustic model

• (DNN-based) model adaptation

⇒ DNN is very powerful to handle spectral information by using deep non-linear processing

But it does not reach enough performance, yet

8989

Recent achievements

• REVERB 2014 (WER)

• CHiME-4 2016 (WER)

Baseline (GMM)


Robust back-end

48.9 %

22.2 %

9.0 %

Robust back-end


11.5 %

7.5 %

3.0 %

Baseline (DNN)

9090

Summary of single-channel distant speech recognition

• Many effective techniques are developed for general speech recognition• Feature transformation

• DNN acoustic model

• (DNN-based) model adaptation

⇒ DNN is very powerful to handle spectral information by using deep non-linear processing

But it does not reach enough performance, yet

• Multi-microphone speech recognition

9191


Feature extraction

Recognizer

Acousticmodel

Languagemodel

Lexicon

My name is …

𝑊𝑂

……SE front-end


Model adaptation


9292


Feature extraction

Recognizer

Acousticmodel

Languagemodel

Lexicon

My name is …

𝑊𝑂

……SE front-end


Model adaptation

Joint training paradigm (Section 3):- Unify speech enhancement and speech recognition

9393

Topics not covered in this tutorial

• Voice activity detection

• Keyword spotting

• Multi-speaker / Speaker diarization

• Online processing

• Data simulation

9494

References (Introduction)

(Barker’13) Barker, J. et al. “The PASCAL CHiME speech separation and recognition challenge,” Computer Speech&Language (2013).

(Barker’15) Barker, J. et al. “The third ’CHiME’ speech separation and recognition challenge: Dataset, task and baselines,” Proc. ASRU(2015).

(Carletta’05) Carletta, J. et al. “The AMI meeting corpus: A pre-announcement,” Springer (2005).

(Chen’15) Chen, Z., et al, “Integration of Speech Enhancement and Recognition using Long-Short Term Memory Recurrent NeuralNetwork,“ Proc. Interspeech (2015).

(Chunyang’15) Chunyang, W., et al. “Multi-basis adaptive neural network for rapid adaptation in speech recognition,” Proc. ICASSP (2015).

(Delcroix’13) Delcroix, M. et al. “Is speech enhancement pre-processing still relevant when using deep neural networks for acousticmodeling?” in Proc. Interspeech (2013).

(Delcroix’15a) Delcroix, M., et al. “Strategies for distant speech recognition in reverberant environments,” Proc. CSL (2015).

(Delcroix’15b) Delcroix, M., et al. “Context adaptive deep neural networks for fast acoustic model adaptation,” Proc. ICASSP (2015).

(Delcroix’16a) Delcroix, M., et al. “Context adaptive deep neural networks for fast acoustic model adaptation in noise conditions,” Proc.ICASSP (2016).

(Delcroix’16b) Delcroix, M., et al. “Context adaptive neural network for rapid adaptation of deep CNN based acoustic models,” Proc.Interspeech (2016).

(ETSI’07) Speech Processing, Transmission and Quality Aspects (STQ); Distributed Speech Recognition; Adv. Front-end FeatureExtraction Algorithm; Compression Algorithms, ETSI ES 202 050 Ver. 1.1.5 (2007).

(Gemmello’06) Gemello, R., et al. “Adaptation of hybrid ANN/HMM models using linear hidden transformations and conservativetraining,” Proc. ICASSP (2006).

(Giri’15) Giri, R., et al. “Improving speech recognition in reverberation using a room-aware deep neural network and multi-tasklearning,” Proc. ICASSP (2015).

(Harper’15) Haper, M. “The Automatic Speech recognition In Reverberant Environments (ASpIRE) challenge,” Proc. ASRU (2015).

(Hori’15) Hori, T., et al, “The MERL/SRI system for the 3rd CHiME challenge using beamforming, robust feature extraction, andadvanced speech recognition,“ Proc. ASRU (2015).

9595


(Hinton’12) Hinton, G., et al. “Deep neural networks for acoustic modeling in speech recognition,” IEEE Sig. Proc. Mag. 29, 82–97(2012).

(Juang’04) Juang, B. H. et al. “Automatic Speech Recognition – A Brief History of the Technology Development,” (2004).

(Kim’12) Kim, C., et al. "Power-normalized cepstral coefficients (PNCC) for robust speech recognition." Proc. ICASSP (2012).

(Kinoshita’13) Kinoshita, K. et al. “The REVERB challenge: a common evaluation framework for dereverberation and recognition ofreverberant speech,” Proc. WASPAA (2013).

(Kuttruff’09) Kuttruff, H. “Room Acoustics,” 5th ed. Taylor & Francis (2009).

(Liao’13) Liao, H. “Speaker adaptation of context dependent deep neural networks,” Proc. ICASSP (2013).

(Lincoln’05) Lincoln, M. et al., “The multichannel wall street journal audio visual corpus (MC-WSJ-AV): Specification and initialexperiments,” Proc. ASRU (2005).

(Matassoni’14) Matassoni, M. et al. “The DIRHA-GRID corpus: baseline and tools for multi-room distant speech recognition usingdistributed microphones,” Proc. Interspeech (2014).

(Mitra’14) Mitra, V., et al. “Damped oscillator cepstral coefficients for robust speech recognition,” Proc. Interspeech (2013).

(Mohamed’12) Mohamed, A. et al. “Acoustic modeling using deep belief networks,” IEEE Trans. ASLP (2012).

(Neto’95) Neto, J., et al. “Speaker adaptation for hybrid HMM-ANN continuous speech recognition system,” Proc. Interspeech(1995).

(Ochiai’14) Ochiai, T., et al. “Speaker adaptive training using deep neural networks,” Proc. ICASSP (2014).

(Pallett’03) Pallett, D. S. “A look at NIST'S benchmark ASR tests: past, present, and future,” ASRU (2003).

(Parihar’02) Parihar, N. et al. “DSR front-end large vocabulary continuous speech recognition evaluation”, (2002).

(Peddinti ‘15) Peddinti, V., et al, “A time delay neural network architecture for efficient modeling of long temporal contexts." Proc.Interspeech (2015).

(Saon’13) Saon, G., et al. “Speaker adaptation of neural network acoustic models using i-vectors,” Proc. ASRU (2013).

9696


(Saon’15) Saon, G. et al. “The IBM 2015 English Conversational Telephone Speech Recognition System,” arXiv:1505.05899 (2015).

(Seltzer’13) Seltzer, M.L., et al. “An investigation of deep neural networks for noise robust speech recognition,” Proc. ICASSP (2013).

(Seltzer’14) Seltzer, M. “Robustness is dead! Long live Robustness!” Proc. REVERB (2014).

(Shluter’07) Schluter, R., et al. "Gammatone features and feature combination for large vocabulary speech recognition." Proc. ICASSP(2007).

(Swietojanski’14b) Swietojanski, P., et al. “Learning hidden unit contributions for unsupervised speaker adaptation of neural network acousticmodels, “ Proc. SLT (2014).

(Tachioka’13) Tachioka, Y., et al. “Discriminative methods for noise robust speech recognition: A CHiME Challenge Benchmark,” Proc.CHiME, (2013).

(Tachioka’14) Tachioka, Y., et al. "Dual System Combination Approach for Various Reverberant Environments with DereverberationTechniques," Proc. REVERB Workshop (2014).

(Tan’15) Tan, T., et al. “Cluster adaptive training for deep neural network,” Proc. ICASSP (2015).

(Waibel’89) Waibel, A., et al. "Phoneme recognition using time-delay neural networks." IEEE transactions on acoustics, speech, andsignal processing (1989).

(Weng’14) Weng, C., et al. "Recurrent Deep Neural Networks for Robust Speech Recognition," Proc. ICASSP (2014).

(Vincent’13) Vincent, E. et al. “The second ’CHiME’ speech separation and recognition challenge: Datasets, tasks and baselines,” Proc.ICASSP (2013).

(Vincent’16) Vincent, E., et al. “An analysis of environment, microphone and data simulation mismatches in robust speechrecognition,” Computer Speech & Language, to appear.

(Xiong’16) Xiong, W., et al. “The Microsoft 2016 Conversational Speech Recognition System,” arXiv:1609.03528 (2016)

(Yoshioka’15a) Yoshioka, T., et al. "Far-field speech recognition using CNN-DNN-HMM with convolution in time," Proc. ICASSP (2015).

(Yoshioka’15b) Yoshioka, T., et al. “The NTT CHiME-3 system: advances in speech enhancement and recognition for mobile multi-microphone devices,” Proc. ASRU (2015).

(Yu’13) Yu, D., et al. “KL-divergence regularized deep neural network adaptation for improved large vocabulary speechrecognition,” Proc. ICASSP (2013).

9898

Reverberation

Background noise

Interfering

speaker

Distant mic

Challenges of DSR

9999

Signal model – Time domain

• Speech captured with a distant microphone array

• Microphone signal at 𝑗th microphone

𝑦𝑗 𝑛 =

𝑙

ℎ𝑗 𝑙 𝑥 𝑛 − 𝑙 + 𝑢𝑗 𝑛 = ℎ𝑗 𝑛 ∗ 𝑥 𝑛 + 𝑢𝑗 𝑛

• 𝑥 𝑛 Target clean speech• ℎ𝑗 𝑛 Room impulse response

• 𝑢𝑗 𝑛 Additive noise (background noise, …)• 𝑛 Time index

𝑥 𝑛 ℎ𝑗 𝑛

𝑦𝑗 𝑛𝑢𝑗 𝑛

100100

Signal model - STFT domain

• Speech captured with a distant microphone array

• Microphone signal at 𝑗𝑡ℎ microphone:

𝑌𝑗 𝑡, 𝑓 ≈

𝑚

𝐻𝑗 𝑚, 𝑓 𝑋 𝑡 − 𝑚, 𝑓 + 𝑈𝑗 𝑡, 𝑓 = 𝐻𝑗 𝑡, 𝑓 ∗ 𝑋 𝑡, 𝑓 + 𝑈𝑗 𝑡, 𝑓

• 𝑋 𝑡, 𝑓 Target clean speech• 𝐻𝑗 𝑡, 𝑓 Room impulse response

• 𝑈𝑗(𝑡, 𝑓) Additive noise• (𝑡, 𝑓) time frame index and frequency bin index

Approximate a long-term convolution in the time domain as a convolution in the STFT domain, because ℎ𝑖 𝑛 is longer than the STFT analysis window

𝑋(𝑡, 𝑓) 𝐻𝑗 𝑡, 𝑓

𝑌𝑗 𝑡, 𝑓𝑈𝑗(𝑡, 𝑓)

101101

SE Front-end

Feature


Acoustic

model

Language

modelLexicon

My name is …

𝑊𝑂

……SE front-end

𝑌𝑗(𝑡, 𝑓)𝑋(𝑡, 𝑓) o𝑡

Model

adaptation

Multi-channel

dereverberation

Multi-channel

noise reduction

Single-channel

noise reduction

102102

Speech enhancement (SE)

• Reduce mismatch between observed speech and ASR back-end due to noise/reverberation

• Single-channel

• Multi-channel (multiple microphones)

103103

Type of processing

• Linear processing• Linear filter constant for long segments

• Non-linear processing• Linear filter changing for each time-frame

• Non-linear transformation

With 𝐹(∙) Non-linear function

104104

Categorization of SE front-ends

Single-channel Multi-channel

Linear processing

• WPE dereverberation (Nakatani’10) • Beamforming (Van Trees’02)• WPE dereverberation (Nakatani’10)• Neural network-based enhancement

(Heymann’15)

Non-linear processing

• Spectral subtraction (Boll’79)• Wiener filter (Lim’79)• Time-frequency masking(Wang’06)• NMF (Virtanen’07)• Neural network-based enhancement (Xu’15,

Narayanan’13, Weninger’15)

• Time-frequency masking (Sawada’04)• NMF (Ozerov’10)• Neural network-based enhancement

(Xiao’16)

105105

Categorization of SE front-ends

Single-channel Multi-channel

Linear processing

• WPE dereverberation (Nakatani’10) • Beamforming (Van Trees’02)• WPE dereverberation (Nakatani’10)• Neural network-based enhancement

(Heymann’15)

Non-linear processing

• Spectral subtraction (Boll’79)• Wiener filter (Lim’79)• Time-frequency masking(Wang’06)• NMF (Virtanen’07)• Neural network-based enhancement (Xu’15,

Narayanan’13, Weninger’15)

• Time-frequency masking (Sawada’04)• NMF (Ozerov’10)• Neural network-based enhancement

(Xiao’16)

Focus on• Multi-channel (multi-microphones)• Linear processing• Neural network-based enhancement

Have been shown to interconnect well with ASR back-end

106106

Single channel speech enhancement

• Classical single channel approaches• Spectral subtraction (or Wiener filtering) (Boll’79, Lim’79)

𝑋 𝑡, 𝑓2= 𝑌 𝑡, 𝑓 2 − 𝑈 𝑡, 𝑓

2

• Difficult to estimate noise term 𝑈 𝑡, 𝑓2

for non-stationary environment• Noise reduction at the cost of introducing distortions• Not targeted for ASR

• Vector Taylor Series (VTS)• Estimate noise in the feature domain• Targeted for ASR (GMM-HMM)

Limited effect with DNN back-ends (Seltzer’13, Li’13)~0% relative improvement w/ VTS or T/F masking

• Neural network-based enhancement• Can learn to reduce non-stationary noise

• Neural network Can be potentially merged into the DNN acoustic model

See Section 3.

107107

2.1 Dereverberation

Feature


Acoustic

model

Language

modelLexicon

My name is …

𝑊𝑂

……SE front-end


Model

adaptation

Multi-channel

dereverberation

Multi-channel

noise reduction

Single-channel

noise reduction

108108

Room impulse response

• Models the multi-path propagation of sound caused by reflections on walls and objects (Kuttruff’09)• Length 200-1000 ms in typical living rooms

• Assumed to be time-invariant (no change in room or microphone/speaker position)

Direct path

time

ℎ𝑗 𝑛

Late reverberation

(duration: 100 ms-1000ms)

Early reflections

(duration: 30-50 ms)

109109

Reverberant speech

𝑌 𝑡, 𝑓 = 𝐻 𝑡, 𝑓 ∗ 𝑋 𝑡, 𝑓 + 𝑈 𝑡, 𝑓

= σ𝜏=0𝑑 𝐻 𝜏, 𝑓 𝑋 𝑡 − 𝜏, 𝑓 + σ𝜏=𝑑+1

𝑇 𝐻 𝜏, 𝑓 𝑋 𝑡 − 𝜏, 𝑓 + 𝑈 𝑡, 𝑓

𝑋(𝑡, 𝑓) 𝐻 𝑡, 𝑓

𝑌 𝑡, 𝑓

∗

Direct + Early sound reflections

𝐷(𝑡, 𝑓)

Late reverberation

𝐿(𝑡, 𝑓)

Dereverberation aims at suppressing late reverberation

(Yoshioka’12b)

Neglect noise for the derivations

110110

Dereverberation

• Linear filtering• Weighted prediction error

• Non-linear filtering• Spectral subtraction using a statistical model of late reverberation (Lebart’01,

Tachioka’14)

• Neural network-based dereverberation (Weninger’14)

111111

Linear prediction (LP)

• Reverberation: linear filter

Can predict reverberation from past observations using linear prediction

(under some conditions)

Current signal

Prediction

・・・・・・・・・・・・

Past signals

Predictable

𝐷 𝑡, 𝑓 and 𝐿 𝑡, 𝑓 are both reduced

Dereverberation:

𝑌 𝑡, 𝑓 = 𝐷 𝑡, 𝑓 + 𝐿 𝑡, 𝑓

𝜏=1

𝐺∗ 𝜏, 𝑓 𝑌 𝑡 − 𝜏, 𝑓

𝐷 𝑡, 𝑓 = 𝑌 𝑡, 𝑓 −

𝜏

𝐺∗ 𝜏, 𝑓 𝑌 𝑡 − 𝜏, 𝑓

(Haykin’96)

112112

Estimation of prediction coefficients

• Model prediction residual (speech STFT coefficients) as a stationary complex Gaussian

𝑝 𝐷 𝑡, 𝑓 = 𝑁𝑐 𝐷 𝑡, 𝑓 ; 0, 𝜙𝐷 =1

𝜋𝜙𝐷𝑒−

𝐷 𝑡,𝑓 2

𝜙𝐷

𝜙𝐷 is the signal power (assumed to be constant)

• Probability of the observed signal given the past observation (Likelihood):

𝑝 𝑌 𝑡, 𝑓 |𝑌 𝑡 − 1, 𝑓 ,… , 𝑌 𝑡 − 𝑇, 𝑓 ; 𝐺 𝜏, 𝑓 =1

𝜋𝜙𝐷𝑒−

𝑌 𝑡,𝑓 − σ𝜏=1 𝐺∗ 𝜏,𝑓 𝑌 𝑡−𝜏,𝑓 2

𝜙𝐷

• The prediction filters can be obtained by maximizing the log-likelihood as

𝓛 𝐺 𝜏, 𝑓 =

𝑡

log 𝑝 𝑌 𝑡, 𝑓 |𝑌 𝑡 − 1, 𝑓 , … , 𝑌 𝑡 − 𝑇, 𝑓

= −

𝑡

log 𝜋𝜙𝐷 −

𝑡

𝑌 𝑡, 𝑓 − σ𝜏=1𝐺∗ 𝜏, 𝑓 𝑌 𝑡 − 𝜏, 𝑓 2

𝜙𝐷

⇒ 𝐺 𝜏, 𝑓 = argmin𝐺 𝜏,𝑓

𝑡𝑌 𝑡, 𝑓 −

𝜏=1

𝐺∗ 𝜏, 𝑓 𝑌 𝑡 − 𝜏, 𝑓

2

Constant terms

113113


• The prediction filter can be obtain as,

𝐺 𝜏, 𝑓 = argmin𝐺 𝜏,𝑓

𝑡𝑌 𝑡, 𝑓 −

𝜏=1

𝐺∗ 𝜏, 𝑓 𝑌 𝑡 − 𝜏, 𝑓

2

• This lead to the expression for the prediction filters as

where,

114114

Problem of LP-based speech dereverberation

• LP predicts both early reflections and late reverberation• Speech signal exhibits short-term correlation (30-50 ms)

LP suppresses also the short-time correlation of speech

• LP assumes the target signal follows a stationary Gaussian distribution• Speech is not stationary Gaussian

LP destroys the time structure of speech

• Solutions: • Introduce a prediction delay (Kinoshita’07)

• Introduce better modeling of speech signals

(Nakatani’10, Yoshioka’12, Jukic’14)

115115

Only reduce 𝐿 𝑡, 𝑓

・・・・・・・・・・・・

Delayed linear prediction (LP)

Current signal

Past signals

Unpredictable

PredictableDelayed LP can only predict 𝐿 𝑡, 𝑓 from past signals

Delay 𝑑 (=30-50 ms)

Prediction

𝑌 𝑡, 𝑓 = 𝐷 𝑡, 𝑓 + 𝐿 𝑡, 𝑓

𝜏=𝑑

𝐺∗ 𝜏, 𝑓 𝑌 𝑡 − 𝜏, 𝑓

(Kinoshita’07)

116116


• Delayed LP:

𝐷 𝑡, 𝑓 = 𝑌 𝑡, 𝑓 −

𝜏=𝑑

𝐺∗ 𝜏, 𝑓 𝑌 𝑡 − 𝜏, 𝑓

• For non-stationary signal with time-varying power 𝜙𝐷 𝑡, 𝑓

𝑝 𝐷 𝑡, 𝑓 = 𝑁𝑐 𝐷 𝑡, 𝑓 ; 0, 𝜙𝐷 𝑡, 𝑓 =1

𝜋𝜙𝐷 𝑡, 𝑓𝑒−

𝐷 𝑡,𝑓 2

𝜙𝐷 𝑡,𝑓

𝜙𝐷 𝑡, 𝑓 is the time-varying signal power

• Prediction filters:

𝐺 𝜏, 𝑓 = argmin𝐺 𝜏,𝑓

σ𝑡𝑌 𝑡,𝑓 − σ𝜏=𝑑 𝐺

∗ 𝜏,𝑓 𝑌 𝑡−𝜏,𝑓2

𝜙𝐷 𝑡,𝑓

𝜙𝐷 𝑡, 𝑓 = argmin𝜙𝐷 𝑡,𝑓

log 𝜋𝜙𝐷(𝑡, 𝑓) +𝑌 𝑡,𝑓 − σ𝜏=𝑑 𝐺

∗ 𝜏,𝑓 𝑌 𝑡−𝜏,𝑓2

𝜙𝐷 𝑡,𝑓

Weighted prediction error (WPE)

(Nakatani’10, Yoshioka’12)

117117

Multi-channel extension

• Exploit past signals from all microphones to predict current signal at a microphone

• Prediction filter obtained as ො𝐠𝑓 = argmin𝐠𝑓

σ𝑡

𝑌1 𝑡,𝑓 −𝐠𝑓𝐻𝐲𝑡−𝑑,𝑓

2

𝜙𝐷 𝑡,𝑓

• Can output multi-channel signals• It preserves early reflections can be used as a front-end to beamformer

・・・・・・・・・・・・

・・・・・・・・・・・・

・・・・・・

𝐷 𝑡, 𝑓 = 𝑌1 𝑡, 𝑓 −

𝑗=1

𝐽

𝜏=𝑑

𝐺𝑗∗ 𝜏, 𝑓 𝑌𝑗 𝑡 − 𝜏, 𝑓

= Y1 𝑡, 𝑓 − 𝐠𝑓𝐻𝐲𝑡−𝑑,𝑓

𝑌1 𝑡, 𝑓

𝜏=𝑑

𝐺𝑗∗ 𝜏, 𝑓 𝑌𝑗 𝑡 − 𝜏, 𝑓

118118

Processing flow of WPE

Dereverberation

Power estimationPrediction filter

estimation

119

Sound demo from REVERB challenge

Headset

Distant(RealData)

WPE+ beamformer

(8ch)

WPE (8ch)

(Delcroix’14)

0 0.8Time (sec.)

Freq

uen

cy (

Hz)

8000

120120

Results for REVERB and CHiME3

Front-end REVERB (8 ch)

CHiME3 (6 ch)

- 19.2 % 15.6 %

WPE 12.9 % 14.7 %

WPE + MVDR Beamformer 9.3 % 7.6 %

Results for the REVERB task (Real Data, eval set) (Delcroix’15)- DNN-based acoustic model trained with augmented training data- Environment adaptation- Decoding with RNN-LM

Results for the CHiME 3 task (Real Data, eval set) (Yoshioka’15)- Deep CNN-based acoustic model trained with 6 channel training data- No speaker adaptation- Decoding with RNN-LM

121121

Remarks

• Precise speech dereverberation with linear processing

• Can be shown to cause no distortion to the target speech

Particularly efficient as an ASR front-end

• Can output multi-channel signals

Suited for beamformer pre-processing

• Relatively robust to noise

• Efficient implementation in STFT domain

• A few seconds of observation are sufficient to estimate the prediction filters

Matlab p-code available at: www.kecl.ntt.co.jp/icl/signal/wpe

122122

2.2 Beamforming

Feature


Acoustic

model

Language

modelLexicon

My name is …

𝑊𝑂

……SE front-end


Model

adaptation

Multi-channel

dereverberation

Multi-channel

noise reduction

Single-channel

noise reduction

123123

Principle

• Pickup signals in the direction of the target speaker

• Attenuate signals in the direction of the noise sources

Beam pattern – microphone array gain as a function of the direction of arrival of the signal

124124

Beampattern

• Beampattern shows the response/gain of the beamformer as a function of the angle for a given frequency

• Provides a way to check the design of a beamformer

• Main lobe: • Directed in the direction of the speaker

Capture the target signal

• Null• In the direction of the point noise sources

Cancel point noise source

• Sidelobe: • Inevitable when designing beamformer

Causes residual noise

Delay-and-sum beampattern of circular array (8 elements, diameter = 20cm) pointing at 120 degrees for frequency 2500Hz.

125125

Microphone signal model

• Consider room impulse responses only within the STFT analysis window• Late reverberation as diffusive noise and included into the noise term

𝑌𝑗 𝑡, 𝑓 ≈

𝑚

𝐻𝑗 𝑚, 𝑓 𝑋 𝑡 − 𝑚, 𝑓 + 𝑈𝑗 𝑡, 𝑓

= 𝐻𝑗 𝑓 𝑋 𝑡, 𝑓

𝑂𝑗 𝑡,𝑓

+ 𝑈𝑗 𝑡, 𝑓

• Using vector notations

𝑋(𝑡, 𝑓)𝐡𝑓

𝐲𝑡,𝑓 =

𝑌1 𝑡, 𝑓⋮

𝑌𝐽 𝑡, 𝑓= 𝐡𝑓 𝑋 𝑡, 𝑓

≜𝐨𝑡,𝑓

+ 𝐮𝑡,𝑓

𝐨𝑡,𝑓 Source image at microphones

𝐡𝑓 = 𝐻1 𝑓 ,… ,𝐻𝐽 𝑓T

Steering vector

𝐮𝑡,𝑓

source image at microphone 𝑗

126126

Steering vector

• Represents the propagation from the source to the microphones, including• Propagation delays (information about the source direction)• Early reflections (reverberation within the analysis window)• Propagation attenuation (microphone gains)

• Example of plane wave assumption with free field condition (no reverberation and speaker far enough from the microphones)

Steering vector given as :

𝐡𝑓 =𝑔1 𝑓 𝑒−2𝜋𝑓Δ𝜏1

⋮𝑔𝐽 𝑓 𝑒−2𝜋𝑓Δ𝜏𝐽

Δ𝜏𝑗𝑌𝑗 𝑡, 𝑓 = 𝑔𝑗 𝑓 𝑒−2𝜋𝑓Δ𝜏𝑗𝑋 𝑡, 𝑓 + 𝑈𝑗 𝑡, 𝑓

Reference microphone

Δ𝜏𝑗 time difference of arrival (TDOA),

frequency independent.

𝑔𝑗 𝑓 signal gain, frequency dependent.

𝑔1 𝑓

𝑔𝐽 𝑓

𝑔𝑗 𝑓

𝑈𝑗 𝑡, 𝑓 reverberation and noise

127127

Beamformer

• Output of beamformer

𝑋 𝑡, 𝑓 =

𝑗

𝑊𝑗∗ 𝑓 𝑌𝑗(𝑡, 𝑓)

• Vector notations𝑋 𝑡, 𝑓 = 𝐰𝑓

𝐻𝐲𝑡,𝑓

𝐰𝑓 = 𝑊1 𝑓 ,… ,𝑊𝐽 𝑓𝑇

𝐲𝑡,𝑓 = 𝑌1 𝑡, 𝑓 , … , 𝑌𝐽 𝑡, 𝑓𝑇

The filters 𝐰𝑓 are designed to remove noise

𝑊1∗(𝑓)

𝑊2∗(𝑓)

𝑊𝑗∗(𝑓)

𝑊𝐽∗(𝑓)

𝑌𝑗 𝑡, 𝑓 𝑋 𝑡, 𝑓+

128128

Processing flow

Filter computation

Spatial information extraction

– Delay and Sum (DS)

– Minimum variance distortionless response (MVDR)

– TDOAs

– Steering vectors

– Spatial correlation matrix

𝑊1∗(𝑓)

𝑊2∗(𝑓)

𝑊𝑗∗(𝑓)

𝑊𝐽∗(𝑓)

𝑌𝑗 𝑡, 𝑓 𝑋 𝑡, 𝑓+

129129

2.2.1 Delay and Sum beamformer

130130

Delay and sum (DS) beamformer

• Align the microphone signals in time• Emphasize signals coming from the target direction

• Destructive summation for signals coming from the other directions

• Requires estimation of TDOAs Δ𝜏𝑗

𝜏𝑗

1

𝑗

Δ𝜏𝑗

𝑒2𝜋𝑖𝑓Δ𝜏1

𝐽

𝑒2𝜋𝑖𝑓Δ𝜏𝑖

𝐽

+𝑌𝑗 𝑡, 𝑓

𝑋 𝑡, 𝑓

(Van Veen’88)

131131

TDOA estimation

• Signal cross correlation peaks when signals are aligned in timeΔ𝜏𝑖𝑗 = argmax

𝜏𝜓𝑦𝑖𝑦𝑗(𝜏)

𝜓𝑦𝑖𝑦𝑗 𝜏 = 𝐸 𝑦𝑖 𝑡 𝑦𝑗(𝑡 + 𝜏)

• The cross correlation is sensitive to noise and reverberation• Usually use GCC-PHAT* coefficients that are more robust to reverberation

𝜓𝑦𝑖𝑦𝑗𝑃𝐻𝐴𝑇 𝜏 = 𝐼𝐹𝐹𝑇

𝑌𝑖 𝑓 𝑌𝑗∗ 𝑓


𝜏

𝜓𝑦𝑖𝑦𝑗

Δ𝜏𝑖𝑗Δ𝜏𝑖𝑗

*Generalized Cross Correlation with Phase Transform (GCC-PHAT)

(Knapp’76, Brutti’08)

𝜓𝑦𝑖𝑦𝑗 𝜏

132132

BeamformIt – a robust implementation of a weighted DS beamformer*

• BeamformIt:• Used in baseline systems for several tasks, AMI, CHiME 3/4

Beamforming

GCC-PHAT coeff. computation

Filtering noisy TDOAs

TDOA tracking with Viterbi decoding

Reference channel selection

Channel weight computation

𝐰𝑓 =𝛼1𝐽𝑒2𝜋𝑖𝑓Δ𝜏1 , … ,

𝛼𝐽𝐽𝑒2𝜋𝑖𝑓Δ𝜏𝐽

𝑇

Filter computation

𝛼1, … , 𝛼𝐽

Δ𝜏1, … , Δ𝜏𝐽

Single channel noise reduction

(Anguera’07)

Toolkit available : www.xavieranguera.com/beamformit

* Also sometimes called filter-and-sum beamformer

133133

2.2.2 MVDR beamformer

134134

Minimum variance distortionless response (MVDR*) beamformer

• Beamformer output:𝑋 𝑡, 𝑓 = 𝐰𝑓

𝐻𝐲𝑡,𝑓 = 𝐰𝑓𝐻 𝐡𝑓 𝑋(𝑡, 𝑓) + 𝐰𝑓

𝐻𝐮𝑡,𝑓

⇒ 𝑋 𝑡, 𝑓 = 𝑋 𝑡, 𝑓 + 𝐰𝑓𝐻𝐮𝑡,𝑓

• Filter is obtained by solving the following:

𝑋(𝑡, 𝑓)𝐡𝑓

𝐲𝑡,𝑓𝐮𝑡,𝑓

Speech 𝑋(𝑡, 𝑓) is unchanged

(distortionless): 𝐰𝑓𝐻𝐡𝑓 = 1

Minimize noise at the output of the beamformer

* MVDR beamformer is a special case of the more general linearly constrained minimum variance (LCMV) beamformer (Van Veen’88)

135135

Expression of the MVDR filter

• MVDR filter given by

𝐰𝑓𝑀𝑉𝐷𝑅 =

𝐑𝑓𝑛𝑜𝑖𝑠𝑒 −1

𝐡𝑓

𝐡𝑓𝐻 𝐑𝑓

𝑛𝑜𝑖𝑠𝑒 −1𝐡𝑓

• Where 𝐑𝑓𝑛𝑜𝑖𝑠𝑒 is the spatial correlation matrix* of the noise, which measures

the correlation among noise signals at the different microphones

𝐑𝑓𝑛𝑜𝑖𝑠𝑒 =

𝑡

𝐮𝑡,𝑓𝐮𝑡,𝑓𝐻 =

1

𝑇

𝑡

𝑇

𝑈1 𝑡, 𝑓 𝑈1∗ 𝑡, 𝑓 ⋯

1

𝑇

𝑡

𝑇

𝑈1 𝑡, 𝑓 𝑈𝐽∗ 𝑡, 𝑓

⋮ ⋱ ⋮

1

𝑇

𝑡

𝑇

𝑈𝐽 𝑡, 𝑓 𝑈1∗ 𝑡, 𝑓 ⋯

1

𝑇

𝑡

𝑇

𝑈𝐽 𝑡, 𝑓 𝑈𝐽∗ 𝑡, 𝑓

* The spatial correlation matrix is also called cross spectral density

136136

Alternate expression

It is possible to derive other expression for the MVDR filters

1. As a function of the microphone signal correlation matrix


𝐑𝑓𝑜𝑏𝑠 −1

𝐡𝑓


𝑜𝑏𝑠 −1𝐡𝑓

2. Without explicit steering vector estimation (Souden’10)



𝐑𝑓𝑠𝑝𝑒𝑒𝑐ℎ

𝑡𝑟𝑎𝑐𝑒 𝐑𝑓𝑛𝑜𝑖𝑠𝑒 −1


𝐮

- 𝐮 = 0,…0,1,0,… , 0 𝑇 vector with 1 for the reference microphone- As it does not require steering vector estimation, it may be easier to use for joint

optimization

137137

Steering vector estimation

The steering vector 𝐡𝑓 can be obtained as the principal eigenvector of the spatial

correlation matrix of the source image signals 𝐑𝑓𝑠𝑝𝑒𝑒𝑐ℎ

𝐡𝑓 = 𝒫 𝐑𝑓𝑠𝑝𝑒𝑒𝑐ℎ

Source image spatial correlation matrix

𝑀(𝑡, 𝑓)𝑌𝑖(𝑡, 𝑓)


σ𝑡𝑀(𝑡, 𝑓)𝐲𝑡,𝑓𝐲𝑡,𝑓𝐻

σ𝑡𝑀(𝑡, 𝑓)


= 𝐑𝑓𝑜𝑏𝑠 − 𝐑𝑓

𝑛𝑜𝑖𝑠𝑒

𝑀(𝑡, 𝑓) = ቊ1 if noise > speech0 otherwise

Spectral masks

𝐑𝑓𝑜𝑏𝑠 =

𝑡

𝐲𝑡,𝑓𝐲𝑡,𝑓𝐻

Microphone signal (speech + noise)

Noise estimate

(Souden’13, Higuchi’16, Yoshioka’15, Heymann’15)

138138

Spectral mask estimation

• Clustering of spatial features for mask estimation• Source models

• Watson mixture model (Souden’13)

• Complex Gaussian mixture model (Higuchi’16)

• Neural network-based approach (Hori'15, Heymann’15)• See section 3.

E-step: update masks M-step: update spatial corr. matrix

𝑀𝑡,𝑓 = 𝑝 𝑛𝑜𝑖𝑠𝑒 𝒚𝑡,𝑓, 𝐑𝑓𝑛𝑜𝑖𝑠𝑒 , 𝐑𝑓

𝑠𝑝𝑒𝑒𝑐ℎ 𝑀𝑡,𝑓

𝐑𝑓𝑛𝑜𝑖𝑠𝑒


σ𝑡𝑀(𝑡, 𝑓)𝐲𝑡,𝑓𝐲𝑡,𝑓𝐻

σ𝑡𝑀(𝑡, 𝑓)

139139

Processing flow of MVDR beamformer

Beamforming

Filter estimation

Steering vectorestimation

Spatial correlationmatrix estimation

Time-frequency mask estimation

𝑀(𝑡, 𝑓) = ቊ1 if 𝐮𝑡,𝑓 > |𝐨𝑡,𝑓|

0 otherwise

𝑋 𝑡, 𝑓 = 𝐰𝑓𝐻𝐲𝑡,𝑓

𝐡𝑓 = 𝓟 𝐑𝑓𝑠𝑝𝑒𝑒𝑐ℎ


= 𝐑𝑓𝑜𝑏𝑠 − 𝐑𝑓

𝑛𝑜𝑖𝑠𝑒



𝐡𝑓


𝑛𝑜𝑖𝑠𝑒 −1𝐡𝑓


140140

Other beamformers

• Max-SNR beamformer* (VanVeen’88, Araki’07, Waritz’07)• Optimize the output SNR without the distortionless constraint

𝐰𝑓𝑚𝑎𝑥𝑆𝑁𝑅 = 𝓟 𝐑𝑓

𝑛𝑜𝑖𝑠𝑒 −1𝐑𝑓𝑜𝑏𝑠

• Multi-channel Wiener filter (MCWF) (Doclo’02)• Preserves spatial information at the output (multi-channel output)

𝐰𝑓𝑀𝐶𝑊𝐹 = 𝐑𝑓

𝑜𝑏𝑠 −1𝐑𝑓𝑠𝑝𝑒𝑒𝑐ℎ

Max-SNR beamformer and MCWF can also be derived from the spatial correlation matrices

• Generalized sidelobe canceller (GSC)• linearly constrained minimum-variance (LCMV) beamformer such as MVDR can be expressed

as a GSC

* Max-SNR beamformer is also called generalized eigenvalue beamformer

Distortion-lessfilter

Noise estimationBlocking matrix

−+

+

141141

2.2.3 Experiments

142142

CHiME 3 results

Results for the CHiME 3 task (Real Data, eval set) - Deep CNN-based acoustic model trained with 6 channel training data- No speaker adaptation- Decoding with RNN-LM

0

2

4

6

8

10

12

14

16

18

No proc

TF Masking

Beamformit

Max SNR

MCWF

MVDR

143143

Sound demo

Clean

MVDR

MASK

Observed(SimuData)

0 6Time (sec.)

Freq

uen

cy (

Hz)

8000

144144

remarks

• Delay-and-sum beamformer Simple approach Relies on correct TDOA estimation

• Errors in TDOA estimation may result in amplifying noise

Not optimal for noise reduction in general

• Weighted DS beamformer (BeamformIt) Includes weights to compensate for amplitude differences among the

microphone signals Uses a more robust TDOA estimation than simply GCC-PHAT

• Still potentially affected by noise and reverberation

Not optimal for noise reduction

• MVDR beamformer Optimized for noise reduction while preserving speech (distortionless)• Extracting spatial information is a key for success

• From TDOA Poor performance with noise and reverberation• From signal statistics More robust to noise and reverberation

More involving in terms of computations compared to DS beamformer

145145

Conclusions for section 2

• Multi-microphone speech enhancement can greatly improve ASR performance• Dereverberation

• Beamformer

• Distortion-less processing is important • Beamforming outperforms TF masking for ASR

• TF masking removes more noise

• Linear filtering causes less distortion (especially with the distortionless constraint)

This leads to better ASR performance

• Future directions• Online extension (source tracking)

• Multiple speakers

• Tighter interconnection of SE front-end and ASR back-end (See section 3.)

146146

References (SE-Front-end 1/4)

(Anguera’07) Anguera, X., et al. “Acoustic beamforming for speaker diarization of meetings,” IEEE Trans. ASLP (2007).

(Araki’07) Araki, S., et al. “Blind speech separation in a meeting situation with maximum snr beamformers,” Proc. ICASSP (2007).

(Boll’79) Boll, S. “Suppression of acoustic noise in speech using spectral subtraction,” IEEE Trans. ASSP (1979).

(Brutti’08) Brutti, A., et al. “Comparison between different sound source localization techniques based on a real data collection,” Proc. HSCMA (2008).

(Delcroix’14) Delcroix, M., et al. “Linear prediction-based dereverberation with advanced speech enhancement and recognition technologies for the REVERB challenge,” Proc. REVERB (2014).

(Delcroix’15) Delcroix, M., et al. “Strategies for distant speech recognition in reverberant environments,” EURASIP Journal ASP (2015).

(Doclo’02) Doclo, S., et al. “GSVD-based optimal filtering for single and multi-microphone speech enhancement,”. IEEE Trans. SP (2002).

(Erdogan’15) Erdogan, H., et al. “Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks,” Proc. ICASSP (2015).

(Haykin’96) Haykin, S. “Adaptive Filter Theory (3rd Ed.),” Prentice-Hall, Inc., Upper Saddle River, NJ, USA (1996).

(Heymann’15) Heymann, J., et al. “BLSTM supported GEV Beamformer front-end for the 3rd CHiME challenge,” Proc. ASRU (2015).

(Heymann’16) Heymann, J., et al. “Neural network based spectral mask estimation for acoustic beamforming,” Proc. ICASSP (2016).

(Higuchi’16) Higuchi, T., et al. “Robust MVDR beamforming using time frequency masks for online/offline ASR in noise,” Proc. ICASSP (2016).

147147


(Hori’15) Hori, T., et al. “The MERL/SRI system for the 3rd CHiME challenge using beamforming, robust feature extraction, and advanced speech recognition,” Proc. of ASRU (2015).

(Jukic’14) Jukic, A., et al. “Speech dereverberation using weighted prediction error with Laplacian model of the desired signal,” Proc. ICASSP (2014).

(Kinoshita’07) Kinoshita, K., et al., “Multi-step linear prediction based speech dereverberation in noisy reverberant environment,” Proc. Interspeech (2007).

(Knapp’76) Knapp, C.H., et al. “The generalized correlation method for estimation of time delay,” IEEE Trans. ASSP (1976).

(Lebart’01) Lebart, K., et al. “A new method based on spectral subtraction for speech Dereverberation,” Acta Acoust (2001).

(Li’13) Li, B. and Sim, K.C., “Improving robustness of deep neural networks via spectral masking for automatic speech recognition,” Proc. ASRU, 2013.

(Lim’79) Lim, J.S., et al. “Enhancement and bandwidth compression of noisy speech,” Proc. IEEE (1979).

(Mandel’10) Mandel, M.I., et al. “Model-based expectation maximization source separation and localization,” IEEE Trans. ASLP (2010).

(Nakatani’10) Nakatani, T., et al. “Speech Dereverberation based on variance-normalized delayed linear prediction,” IEEE Trans. ASLP (2010).

(Narayanan’13) Narayanan, A., et al. “Ideal ratio mask estimation using deep neural networks for robust speech recognition,” Proc. ICASSP (2013).

(Ozerov’10) Ozerov, A., et al. “Multichannel Nonnegative Matrix Factorization in Convolutive Mixtures for Audio Source Separation,” in IEEE Trans. ASLP (2010).

(Sawada’04) Sawada, H., et al. “A robust and precise method for solving the permutation problem of frequency-domain blind source separation,” IEEE Trans. SAP (2004).

148148


(Seltzer’13) Seltzer, M. L. Et al. “An investigation of noise robustness of deep neural networks,” Proc. ICASSP, 2013

(Soudem’10) Souden, M., et al. “On optimal frequency-domain multichannel linear filtering for noise reduction,” IEEE Trans. ASLP (2010).

(Souden’13) Souden, M., et al. “A multichannel MMSE-based framework for speech source separation and noise reduction,” IEEE Trans. ASLP (2013).

(Tachioka’14) Tachioka, Y., et al. “Dual system combination approach for various reverberant environments with dereverberation techniques,” Proc. REVERB (2014).

(Van Trees’02) Van Trees, H.L. “Detection, estimation, and modulation theory. Part IV. , Optimum array processing,” Wiley-Interscience, New York (2002).

(Van Veen’88) Van Veen, B.D., et al. “Beamforming: A versatile approach to spatial filtering,” IEEE ASSP (1988).

(Virtanen’07) Virtanen, T. “Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria,” IEEE Trans. ASLP (2007).

(Wang’06) Wang, D. L., & Brown, G. J. (Eds.). “Computational auditory scene analysis: Principles, algorithms, and applications,” Hoboken, NJ: Wiley/IEEE Press (2006).

(Wang’16) Wang, Z.-Q., et al., “A Joint Training Framework for Robust automatic speech recognition,” IEEE/ACM Trans. ASLP (2016).

(Waritz’07) Warsitz, E., et al. “Blind acoustic beamforming based on generalized eigenvalue decomposition,” IEEE Trans. ASLP (2007).

(Weninger’14) Weninger, F., et al. “The MERL/MELCO/TUM system for the REVERB challenge using deep recurrent neural network feature enhancement,” Proc. REVERB (2014).

(Weninger ’15) Weninger, F., et al. “Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR,” Proc. LVA/ICA (2015).

149149


(Xiao’16) Xiao, X., et al. “Deep beamforming networks for multi-channel speech recognition,” Proc. ICASSP (2016).

(Xu’15) Xu, Y., et al. “A regression approach to speech enhancement based on deep neural networks,” IEEE/ACM Trans. ASLP (2015).

(Yoshioka’12) Yoshioka, T., et al. “Generalization of multi-channel linear prediction methods for blind MIMO impulse response shortening,” IEEE Trans. ASLP (2012).

(Yoshioka’12b) Yoshioka, T., , et al. “Making machines understand us in reverberant rooms: robustness against reverberation for automatic speech recognition,” IEEE Signal Process. Mag. (2012).

(Yoshioka’15) Yoshioka, T., et al. “The NTT CHiME-3 system: advances in speech enhancement and recognition for mobile multi-microphone devices,” Proc. ASRU (2015).

151

1.

2.

3.

4.

153153

Integration of front-end and back-end with deep network

• Techniques discussed so far are designed separately. • SE front-end, feature extraction, and acoustic model

• Next, we examine techniques that perform global optimization of these tasks.

• Towards end-to-end acoustic modeling: integrated SE, FE, and AM.

Feature


Acoustic

model

Language

modelLexicon

My name is …

𝑊𝑂

……SE front-end


Model

adaptation

154154

Integrated beamforming and acoustic modeling

• We will focus on the integration of beamforming front-end with acoustic modeling.

Recognizer

Language

modelLexicon

My name is …

𝑊𝑂

……SE front-end


Model

adaptation

Multi-channel

dereverberation

Multi-channel

noise reduction

(Beamforming)

Single-channel

noise reduction

Feature

extraction

Acoustic

model

155155

Review of traditional processing pipeline of beamforming for ASR

Limitations:

• Not fine-tuned for ASR – optimized for signal level criterion (e.g. SNR)

rather than ASR criterion (e.g. WER).

• No learning capability – cannot leverage on history data (experience) for

better performance.

• Delay-and-sum

• Weighted delay-and-sum

• MVDR (Minimum Variance

Distortionless Response)

• …

• Filter banks

• MFCCs

• …

To decoder

𝑌𝑗(𝑡, 𝑓) 𝑋(𝑡, 𝑓) 𝐨𝑡 𝑝(𝑠𝑡 = 𝑘|𝐨𝑡)

• DNN

• LSTM

• CNN

• …

BeamformingFeature

Extraction

Acoustic

model neural

nets

156156

Towards end-to-end acoustic modeling

• All processing steps are implemented as neural nets

learn beamforming and feature extraction from data (may

incorporate prior knowledge in model design).

global optimization for ASR.

• Focus on neural nets implementation of beamforming

learning based beamforming

• DNN

• LSTM

• CNN

• …

• Filter banks

• MFCCs

• DNN

• LSTM

• CNN

• …

• DNN

• LSTM

• CNN

• …

flow direction of gradient

𝑌𝑗(𝑡, 𝑓) 𝑋(𝑡, 𝑓) 𝐨𝑡 𝑝(𝑠𝑡 = 𝑘|𝐨𝑡)Beamforming

neural net

Feature

Extraction

Acoustic

model neural

net From ASR

157157

Taxonomy of Beamforming

• Generative: use a generative model to guide BF filter optimization.

• Discriminative: use ASR cost to guide filter learning or prediction.

Beamforming

TraditionalLearning

based

DS, MVDR, MWF, (Section 2.2)

Generative Discriminative

Max. likelihood beamforming (Section 3.1)

BF filter learning

BF filter prediction

With phase information (Section 3.2)

No phase information (Section 3.3)

Direct filter prediction

(Section 3.4)

Filter prediction with mask

(Section 3.5)

• This section will cover the 5 blue leaf nodes in details.

BF: beamforming

158158

Use of prior knowledge in learning based beamforming

Delay-and-sum, MVDR, etc

More signal processing More deep learning

Multi-channel acoustic model

Mask prediction by DNN

Filter prediction by DNN, ML

beamforming

• Classical array processing theory provides us valuable knowledge about how to

design beamforming.

• Questions for model design:

• Shall we use our knowledge about classic beamforming in learning based methods?

• How much knowledge and how to incorporate it into the model design?

159159

3.1 Generative Approach- Maximum Likelihood Filter Estimation

160160

Generative approach

• Observed features depend on both steering vector and

state sequence.

• The task is to infer the steering vector (the beamforming

filter) and state sequence from observed features.

• In the training phase, we learn acoustic model 𝑝 𝑂 𝑆; 𝜃)

from single-channel clean speech.

• 𝑂 is the feature vector sequence, e.g. MFCCs or filterbanks.

• 𝑆 is the state sequence.

• 𝜃 represents the acoustic model, e.g. an HMM/GMM model.

• In the test phase, we estimate 𝑆 and 𝐰 by maximizing the

posterior probability 𝑝(𝐰, 𝑆|𝑂; 𝜃).

w

O

S

Observed features

Steering vector

State sequence

161161

Parameter Estimation

• It is difficult to jointly estimate 𝑆 and 𝐰 from 𝑝(𝐰, 𝑆|𝑂; Λ).

• We can estimate 𝑆 and 𝐰 in alternate order, i.e. fix 𝐰 and

estimate 𝑆 fix 𝑆 and estimate 𝐰 … until convergence.

• Fix 𝐰 to its current estimate and obtain 𝑆 by maximizing

𝑝(𝑆| ෝ𝐰, 𝑂; Λ).

• If word label is available – supervised estimation – no need to

decode 𝑂.

• If word label is not available – unsupervised estimation –

equivalent to Viterbi decoding of features generated from

beamformed signals using filter weights 𝐰.

• Fix 𝑆 to its current estimate and obtain 𝐰 by maximizing

𝑝(𝐰| መ𝑆, 𝑂; Λ).

• ML estimation to be described in next slide.

(Seltzers’03/04)

w

O

S

Observed features

Steering vector

State sequence

162162

Maximum Likelihood Estimation (LIMABEAM)

• Given 𝑆, the filter weight can be obtained by using

the maximum likelihood criterion, assuming non-

informative prior for 𝐰:

• Filter weights 𝐰 can be in either time domain or

frequency domain.

• The estimation problem can be solved iteratively

with the expectation-maximization (EM) algorithm

(see section 1.2. for the EM algorithm).

• In E-step, update state alignment and Gaussian

posterior probabilities using current 𝐰.

• In M-step, update filter weights by using gradient

ascend and current Gaussian posteriors.

(Seltzers’03/04)

Feature

Extraction

Negative log likelihood

on HMM/GMM

Single-channel enhanced Fourier coefficients

MFCC

Multi-channel speech signals

Short-Time

Fourier Transform

(STFT)

Filter-and-sum in

Time Domain

Tunable filter

weights in time

domain

Single-channel enhanced waveform

HMM State sequence

ෝ𝐰 = argmax𝐰

log 𝑝(𝑂|𝐰, 𝑆; 𝜃)

Time domain LIMABEAM represented as computational graph

Red arrows show the back propagation of gradient for tuning the filter weights

𝐰

ො𝑥(𝑡) =

𝑗

𝑤𝑗(𝑡) ∗ 𝑦𝑗 (𝑡)

163163

Performance of LIMABEAM

• CMU WSJ PDA corpus

• Array with 4 distant mics (see figure).

• Simultaneous close-talk recording.

• Room T60=0.27s. SNR=13dB (PDA-B subset).

• Speaker reads text from the test set of WSJ0.

• Variable position of PDA from utterance to utterance.

• ASR settings

• HMM/GMM acoustic model trained from 7000

clean sentences, 39D MFCC features (CMN).

• LIMABEAM settings

• First pass decoding on delay-and-sum (DS)

features to obtain hypothesized text that is used

to obtain HMM state alignment.

• LIMABEAM weights initialized as DS weights (20

taps in time domain for each channel).

Settings WER

1 distant mic 68.3%

Delay-and-sum 58.9%

LIMABEAM (Unsupervised)

39.8%

Close talk 27.5%

WER on PDA-B subset (Seltzer’04)

(Seltzers’03/04)

164164

Frequency Domain Implementation for CHiME4

• ML estimation is revised as follows:

• HMM/GMM is replaced by a GMM (1024

Gaussians) no need to find S avoid

multi-pass decoding in unsupervised mode.

• Log determinant of feature covariance

matrix σ𝑜 is used to account for feature

space change due to beamforming.

• L2 regularization is used to avoid overfitting.

𝐰0 is initial weights.

(Xiao’16b)

ෝ𝐰 = argmax𝐰

log 𝑝(𝑂|𝐰 ; 𝜃)+𝑇

2log |σ𝑜| −

𝛼

2|𝐰 −𝐰0|

2

Tunable Beamforming Weights

Beamforming in

Frequency

Domain

Feature

Extraction

Negative Log Likelihood

on HMM/GMM or

UBM/GMM

Multi-channel Fourier coefficients


MFCC or log Mel filterbanks

Negative Log

Determinant of

Covariance

L2 Regularization

Cost

Short-Time

Fourier Transform

(STFT)

Blue boxes are cost functions.


𝐰

𝛼

2|𝐰 −𝐰0|

2

−𝑇

2log |σ𝑜| − log 𝑝(𝑂|𝐰 ; 𝜃)


165165

Frequency Domain Implementation for CHiME4

• Two ways of parameterization of the beamforming filter weights.

• Option 1: Real + imaginary parts of weights

• 𝑤𝑗 𝑓 = 𝑎𝑗 𝑓 + 𝑖𝑏𝑗(𝑓). Number of free parameters = 2JF, J is the number of channels,

and F is the number of frequency bins.

• Option 2: TDOA + gain

• 𝐰 𝑓 = [𝑔1(𝑓)𝑒−2𝜋𝑓Δ𝜏1 , … , 𝑔𝐽(𝑓)𝑒

−2𝜋𝑓Δ𝜏𝐽]𝑇.

• Assume TDOA is frequency-independent to reduce the number of free parameters.

• We may first estimate frequency-independent gain (less parameters, more reliable estimation) and then use it to initialize the frequency-dependent gain (more parameters, use 𝐿2 regularization).

• Two Feature spaces

• MFCC features (39D) is used for GMM training (diagonal covariance matrices) and ML filter estimation.

• Log Mel filterbank features (40D) is used for decoding with the DNN based acoustic model.

(Xiao’16b)

166166

Feature Extraction Steps

Frequency bin

Mel filterb

ank in

dex

50 100 150 200 250

5

10

15

20

25

30

35

40 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

• Implemented as linear transform𝐲𝑡 = 𝐀𝐱𝑡

• 𝐀 is a 40x257 matrix, where 40 is the number of Mel filterbanks, and 257 is the number of frequency bins.

• Parameters of 𝐀 is shown as follows:

• Obtain the power spectrum from complex spectrum.

• Forward pass: 𝐲𝑡 = 𝐱𝑡 ∘ 𝐱𝑡∗, where ∘ is

elementwise multiplication and 𝐱𝑡∗ is complex

conjugate of 𝐱𝑡. • Backward pass: 𝚫𝐱𝑡 = 2(𝚫𝐲𝑡 ∘ 𝐱𝑡

∗)∗.

167167


• Forward pass: 𝑦𝑡 = 𝑥𝑡 − 𝜇𝑥, where 𝜇𝑥 is the mean of 𝑥𝑡 over time.

• Backward pass: 𝛻𝑥𝑡 = 𝛻𝑦𝑡 − 𝜇𝛻𝑦𝑡 , where 𝜇𝛻𝑦𝑡 is

the mean of 𝛻𝑦𝑡.

• Forward pass: 𝛥𝑥𝑡 = σ𝜏=12 𝜏(𝑥𝑡+𝜏 − 𝑥𝑡−𝜏) /

(2σ𝜏=12 𝜏2), where 𝛥𝑥𝑡 and 𝑥𝑡 are the delta and

original features, respectively. • Implemented as affine transform of feature

trajectories: Δ𝐱 = 𝐀𝐱, where 𝐱 = 𝑥1, … , 𝑥𝑇𝑇, 𝐀

is a 𝑇 × 𝑇 sparse transform (see Xiao’16d). • Backward pass is straightforward as it is a linear

transform. • Both delta and acceleration features are used.

• Forward pass: 𝑦 = log 𝑥 .• Backward pass: 𝛻𝑥 = 𝛻𝑦 /𝑥.• A small constant is added to prevent numerical

instable gradient due to 1/𝑥.

168168


• Forward pass: 𝐲𝑡 = 𝐱𝑡−𝐿𝑇 , … , 𝐱𝑡+𝐿

𝑇 𝑇, where 2L+1 is the context size.

• Backward pass: collect gradient 𝛁𝐱𝑡 from corresponding dimensions of 𝛁𝐲𝑡of neighboring frames.

• Implemented as a diagonal affine transform.• Parameters obtained from a portion of training

data using initial beamforming filter weights. These parameters are fixed during training.

169169

Results of Modified LIMABEAM on CHiME4

• Modified LIMABEAM implemented in the SignalGraph toolkit: https://github.com/singaxiong/SignalGraph

• A Matlab based deep learning toolkit for signal processing tasks.

• Available online in a few days after the tutorial.

• Modified LIMABEAM performs slightly worse (14.5%) than BeamformIt (13.6%).

• Perhaps due to the local optimal problem of EM algorithm.

• When initialized with MVDR weights, modified LIMABEAM reduces WER from 9.5% to 9.2%.

• Analysis

• Filter weights are optimized for GMM instead of DNN acoustic model not optimal performance.

• Not using first pass decoding for generating state alignment less accurate Gaussian posteriors?

Methods Settings WER

Single channel

Single channel distant microphone.

21.6%

Delay-and-sum (BeamformIt)

Weighted delay-and-sumwith TDOA tracking and channel gain estimation.

13.6%

MVDRMVDR using LSTM predicted time-freq. mask.

9.5%

Modified LIMABEAM

TDOAs initialized as 0, gains initialized as 1.

14.5%

Weights initialized by MVDR beamforming

9.2%

WER on CHiME4 6-ch track et05-real on DNN SMBR (6-ch pooled training) and trigram LM.

(Xiao’16b)

https://github.com/singaxiong/SignalGraph

170170

Summary of Generative Approach

• Advantage of generative approaches

• Rely only on clean acoustic model for filter estimation robust to unseen noise types.

• Do not need to use clean-noisy parallel speech for training.

• Limitations

• Filter estimation is time consuming.

• Mismatched models in filter estimation (GMM) and decoding (DNN).

• Ways for improvement:

• Better gemerative model for capturing a distribution of speech signals.

• Better weight initialization (EM is sensitive to initialization).

• Better and faster optimization methods.

• Optimizing directly for DNN used in decoder.

• Multi-condition environment adaptive training (similar to speaker adaptive

training).

171171

Summary of Generative Approach

Delay-and-sum, MVDR: uses signal

processing theory to estimate filters.

More signal processing More machine learning

Multi-channel acoustic model: do not even use beamforming

equation. Black box approach.

Maximum likelihood beamforming:

uses beamforming equation, but uses

machine learning for filter estimation.

172172

3.2 Discriminative Approach- filter learning with phase information

173173

Filter learning with phase information

• Both magnitude and phase information are available for the acoustic model.

• Use time domain raw waveforms as inputs.

• Or use frequency domain magnitude+phase or real+imaginary parts as inputs.

• Let the network learn a bank of spatial filters that are optimized for ASR.

• The learned spatial filters usually look at different directions of arrival (DOA)

(Hoshen’15, Sainath’15).

• The information from all directions are fed to the rest of acoustic model.

• Let the model decide how to use these information for best ASR performance.

Spatial Filter 1

Spatial Filter PArray inputs

Rest of Acoustic Model

Speech class posteriors

174174

Time domain filter learning

• Filter-and-sum approximated by multi-

channel temporal convolution:

• Input size is 𝐽 × 𝑀 matrix of raw

waveform samples. J is #channels, M is

#samples.

• P filters 𝐰1 to 𝐰𝑃, each with weight

matrix of size 𝐽 × 𝑁, and 𝑁 < 𝑀.

Convolve along time axis. P up to 256.

• Max-pool the output of each filter.

• The output of convolutional layer (1 ×

𝑃) is used for acoustic modeling.

• Equivalent to using a bank of P

beamforming filters, each looking at the

an arrival angle and a frequency band.

(Hoshen’15, Sainath’15)

Pooling and nonlinearity

Acoustic Model Network

(convolution + LSTM + DNN)

Multi-channel speech waveform of a frame: 𝐱 ∈ ℝ𝐽×𝑀

Speech state posterior

ASR Cross Entropy

Cost

𝐳 ∈ ℝ1×𝑃

𝐰1 ∈ ℝ𝐽×𝑁 𝐰𝑃 ∈ ℝ𝐽×𝑁

Multi-channel temporal convolution

• J: number of channels• M: frame size in

samples.• P: number of filters• N: filter length in

samples.

𝐲1 ∈ ℝ𝑀−𝑁+1 𝐲𝑃 ∈ ℝ𝑀−𝑁+1

175175

Learned beamforming filters (2ch)

Selected unfactored filters

• Filters are 25ms long. There are 256 of them in the network.

• Filters exhibits strong frequency selectivity and also spatial selectivity.

(Sainath’15)

176176

Time domain filter learning

• Advantages:

• Global optimization of beamforming, feature learning, and acoustic modeling.

• Limitations:

• Fixed looking directions. Good for 2-channel case as the beam is wide and few looking

directions are enough. Large arrays have narrower beams and may require a large

number of filters.

• Filters have both spatial and frequency selectivity need large number of filters to

cover all direction/frequency combinations.

• Dependency on fixed array geometry need to retrain if geometry changes.

• Instead of using filterbanks, features are learned from multi-channel data cannot

share the acoustic model with single channel data need to maintain multiple AMs.

(Hoshen’15, Sainath’15)

177177

Factored time domain filter learning

• Simultaneous spatial and frequency selectivity need large number of filters to cover all spatial directions and frequency bands.

• Use factorized filters: spatial filter layer (tconv1) followed by temporal filter layer (tconv2).

• tconv1 contains very few short multi-channelfilters (P up to 10, N=80 taps, or 5ms) focus on spatial filtering.

• Output of tconv1 are P waveforms 𝐲1 to 𝐲𝑃, each beamformed with a predefined looking direction.

• tconv2 has a large number of single-channelfilters (F=128) with long filter length (L=400 taps, or 25ms) focus on temporal filtering.

• tconv2 is shared by all P versions of waveforms from tconv1.

(Sainath’16)

Pooling and nonlinearity

Acoustic Model Network

(convolution + LSTM + DNN)

Multi-channel speech waveform of a frame: 𝐱 ∈ ℝ𝐽×𝑀


ASR Cross Entropy

Cost

𝐲𝑃 ∈ ℝ𝑀𝐲1 ∈ ℝ𝑀

𝐨 ∈ ℝ𝐹×𝑃

𝐰1 ∈ ℝ𝐽×𝑁 𝐰𝑃 ∈ ℝ𝐽×𝑁

tconv1: multi-channel temporal convolution

tconv2: single channel temporal

convolution 𝐰 ∈ ℝ𝐿×𝐹

𝐳 ∈ ℝ𝑀−𝐿+1×𝐹×𝑃

178178

Learned beamforming filters (2ch)

• Filters are 5ms long. Only 10 of them in the network.

• Less obvious frequency selectivity for some filters better factorization.

• However, all the filters still exhibit both kinds of selectivity the factorization is not perfect.

Selected unfactored filters All 10 factored filters (tconv1)

• Filters are 25ms long. There are 256 of them in the network.

• Filters exhibits strong frequency selectivity and also spatial selectivity.

179179

ASR results

• For 1-ch case, waveform input (i.e. automatic feature learning) outperforms filterbanks.

• 2ch Unfactored filter learning outperforms 8ch delay-and-sum and MVDR beamformers. (MVDR performance heavily depends on statistics estimation)

• Factored filters outperforms unfactored ones significantly. Multi-task learning works well.

• Conclusion: filter learning significantly outperforms traditional beamforming methods.

Methods Word Error Rates (%)

Cross entropy

Sequential

1-ch, filterbanks 25.2 20.7

1-ch, waveform 23.5 19.2

Delay-and-sum, 8ch, waveform 22.4 18.8

MVDR, 8ch 22.4 18.7

Unfactored filter learning (2ch) 21.8 18.2

Factored filter learning 20.4 17.3

Factored filter learning + MSE training (MTL)

20.0 17.0

Experimental settings

• 2000 hours of simulated training data, 3 million sentences (voice queries)

• Reverberation simulation: 100 room configurations, T60 in [0.4s 0.9s].

• Distance between source and array in [1m 4m].

• Noise: SNR in [0dB 20dB], average = 12dB. Noise samples from Youtube and “daily life” recordings.

• 8 channel uniform linear array data simulated (2cm distance between 2 mics).

• 20 hours simulated test data (no overlap in T60 and SNR setings). Array geometry identical to training data.

(Hoshen’15, Sainath’15/16)

180180

Summary

• Learning fixed spatial filters• Each filter looks at one direction, the filters together cover all possible directions.

• Each filter also do bandpass filtering, especially in unfactored filter learning.

• Use little knowledge of classic beamforming• Only the multi-channel temporal convolution is motivated by filter-and-sum.

• Joint filter learning and feature learning more flexible model perhaps more potential gain.

• Drawbacks:• Do not make use of theoretical knowledge of signal processing.

• Array geometry dependent network needs to be retrained for a new geometry.

• Feature learning Every instance of model uses different features hard to share models between corpus.

181181

Summary

Delay-and-sum, MVDR: uses signal

processing theory to estimate filters.


Filter learning: do not even use beamforming

equation. Black box approach.

Maximum likelihood beamforming:

uses beamforming equation, but uses

machine learning for filter estimation.

182182

3.3 Discriminative Approach - filter learning without phase information

183183

Filter learning without phase information

• Concatenate speech features (e.g. log mel filterbank) for each

channel at the input of the acoustic model.

• With fully connected networks (Swietojanski’13 , Liu’14)

• With CNNs (Swietojanski’14a)

• Features derived from power phase information is ignored.

• Only make use of power difference information of channels not

enough information for spatial filtering.

(Marino’11, Swietojanski’13 , Liu’14, Swietojanski’14a)

𝐨𝑡 =𝐲𝑡,1𝐲𝑡,2

𝐲𝑡,1

𝐲𝑡,2

Acoustic modeling network

184184

CNN-based multi-channel input (feature domain)

• Each channel considered as a different feature map input to a CNN acoustic model

ℎ𝑡,𝑓

𝐰𝑗𝐲𝑡,𝑓,𝑗

Process each channel with same filter 𝑤Max pooling across channels Select the “most reliable” channel for

each time-frequency bin

𝐰

𝐲𝑡,𝑓,𝑗 𝐰

𝐰

max𝑗(ℎ𝑡,𝑓,𝑗)

• Process each channel with different filters 𝑤𝑗

• Sum across channels

Similar to beamforming but - Filter shared across time-frequency bins- Input does not include phase information

ℎ𝑡,𝑓,𝑗

Channel wise convolution

ℎ𝑡,𝑓 = 𝜎

𝑗

𝑤𝑗 ∗ 𝑦𝑡,𝑓,𝑗 + 𝑏

ℎ𝑡,𝑓,𝑗 = 𝜎 𝑤 ∗ 𝑦𝑡,𝑓,𝑗 + 𝑏

Conventional CNN

Input maps Input maps

(Swietojanski’14a)

185185

Results for AMI corpus

DNN CNN

Single distant mic 53.1 % 51.3 %

Multi-channel input (4ch) 51.2 % 50.4 %

Multi-channel input (4ch)channel-wise convolution

- 49.4 %

Beamformit (8ch) 49.5 % 46.8 %

Back-end configuration:- 1 CNN layer followed by 5 fully connected layers- Input feature 40 log mel filterbank + delta + delta-delta

- Inputting multi-channel improves over single-channel input- Beamforming seems to perform better possibly because it exploits phase

difference across channels

Results from (Swietojanski’14a)

186186

Summary

Delay-and-sum, MVDR


Filter learning without phase information

Maximum likelihood beamforming

• Using multi-channel power features improve ASR performance.

• Phase is not used not performing spatial filtering.

• Black box approach not clear about how the model make use of multi-

channel information.

187187

3.4 Discriminative Approach- filter prediction by neural networks

188188

Beamforming filter prediction

• Filter learning uses a bank of pre-defined spatial filters, each looking at

different spatial directions filters not customized to test sentences.

• Filter prediction estimates one spatial filter for a test sentence.

• STFT domain filter prediction (Xiao’16a/c, work started in JSALT 2015)

• Time domain filter prediction (Li’16)

𝑋 𝑡, 𝑓 =

𝑗

𝑊𝑗 𝑓 𝑌𝑗(𝑡, 𝑓)𝑌𝑗 𝑡, 𝑓

+

Beamforming filter predictor network

𝑊1(𝑓)

𝑊𝑗(𝑓)

𝑊𝐽(𝑓) Classic beamforming formula is used.

(Xiao’16a/c, Li’16)

189189

Frequency Domain Filter Prediction

• Input to network is phase-carrying features

• Generalized cross correlation

• Spatial covariance matrix

• Output of network is filter weights.

• Predict real and imaginary parts separately.

• 8ch x 257 bins x 2 = 4112 real values.

• Mean pooling over a sentence for stable filter prediction.

• Can be removed if speaker is moving.

(Xiao’16a/c)

Mean pooling

GCC feature vectors (reshaped to 21x28) of 4 DOA angles (45, 145, 225, 315).

Multi-channel speech signals of an utterance

Filter prediction network

Extracting Generalized Cross Correlation

28D2

1D

58

8D

41

12

D

Real/imaginary beamforming weights of all frames (frame rate = 10Hz)

257D

8D

Real part Imaginary part

Average beamforming

weights

41

12

D

257D

Illustration generated from 8-mic circular array with 20cm diameter.

190190

Phase-carrying features

• Phase differences between channels tell us spatial location of target and interference.

• Generalized cross correlation (GCC, Knapp’76, Xiao’16a, see section 2.2.1) • 28 microphone pairs from 8 microphones.

• 21 features for every microphone pair for 16kHz fs and 20cm max microphone distance.

• Total features per frame is 21x28=588 features.

• Normalized spatial covariance matrix (SCM) ( Xiao'16c, see section 2.2.2)• 28 complex values (upper triangular) from SCM of every frequency bin.

• Need to reduce dimensionality by averaging neighboring bins.

(Xiao’16a/c)

GCC feature vectors (reshaped to 21x28) of 4 DOA angles (45, 145, 225, 315).

28D

21

D

Different DOA angles have different patterns good for predicting DOA and filter weights.

GCC features SCM features8x8 SCM of two time-frequency bins

• Power normalization reduces variability due to speech content make phase information more obvious.

• Upper triangle (inside red lines) are used as features.

𝐑𝑓 =1

𝑇

𝑡

𝐲𝑡,𝑓𝐲𝑡,𝑓𝐻

𝜓𝑦𝑖𝑦𝑗𝑃𝐻𝐴𝑇 𝜏 = 𝐼𝐹𝐹𝑇



191191

System Diagram

• Cost function is cross entropy

of ASR.

• Acoustic model and

beamforming filter prediction

networks are jointly

optimized.

• Beamforming in frequency

domain.


Beamforming in

Frequency Domain

Feature Extraction

Acoustic Model DNN




Complex-valued beamforming weights


Log Mel filterbank features

Short-Time Fourier

Transform (STFT)

ASR Cross Entropy

Cost

(Xiao’16a/c)

192192

Training Procedures

Difficult to train the global network due to deep structure

Use existing beamforming methods to pre-train the filter predicting network.

(Xiao’16a/c)


Beamforming in

Frequency Domain

Feature Extraction

Acoustic Model DNN






Log Mel filterbank features

Short-Time Fourier

Transform (STFT)

ASR Cross

Entropy Cost



Mean square

error

Target beamforming weights, e.g. delay-and-sum weights

1. Pre-training 2. Fine-tuning

193193

Experimental settings on AMI meeting corpus

• 8-channel table-top circular array,

20cm diameter.

• Filter prediction network: 3x1024

DNN.

• Pre-trained by 100 hours of 8-ch simulated

data (WSJCAM0 + reverb + noise)

• Fine tuning on on 75 hours of AMI data.

• Simulated data and real data share the same

array geometry.

• Acoustic model: 6x2048 DNN.

• Trained by 75 hours of enhanced AMI data.

• 10 hours eval data on AMI.

0o

45o

90o

135o

180o

225o

270o

315o

A circular array with 20cm diameter and 8 microphones

194194

Delay-and-Sum

DNN (Pre-trained by delay-and sum)

DNN (refined by speech enhancement cost)

DNN (refined by ASR cross entropy cost)

Predicted filter’s beam patterns (Xiao’16a)

• Beam patterns for simulated sentence, DOA = 166 degrees (denoted by the red square).

• 8-channel circular microphone array, 20cm diameter.

• Pre-trained network predicts similar filter as delay-and-sum good student learns well from teacher.

• DOA is estimated correctly at all frequencies.

• CE refined network predicts different DOAs differently from pre-trained network.

Delay-and-Sum with true DOA.




195195

Predicted filter’s beam patterns (Xiao’16a)

• Beam patterns for real sentence (from AMI eval set), no ground truth DOA.

• All predicted filters predict to similar direction, about 110 degrees.

• Moderately different beam patterns between different stages of training.




196196

Enhanced speech (Xiao’16a)

• Spectrograms from AMI real evalset.

• Direct path signal is enhanced more than reflected signals improved direct-to-reverberant energy ratio. • See frames near 140 for reduced

reverberation.

Close-talk mic (IHM)

0Hz

8000Hz

1st far-talk mic (SDM1)

Beamformed by pre-trained network

Beamformed by enhancement refined

network

Beamformed by ASR cross entropy refined network

Frame index

197197

Results on the AMI corpus

Methods

Phase-carryingfeatures

GCC SCM

1 ch (close-talk) 25.5 %

1 ch (far-talk) 53.8 %

Delay-and-sum, 8ch (BeamformIt)

47.9 %

Filter predictor network,8ch, pre-trained by delay-and-sum and

simulated data

47.2 % 45.9%

Filter predictor network,8ch, refined by ASR cost

on real data44.7 % 44.1%

(Xiao’16a/c)

Spontaneous speech + nonnative speakers high WER even for

close-talk microphone.

Train on simulated data, test on real data.

Fine-tuning with ASR cost function

Word Error Rates (%) on AMI eval set. Trigram LM is used.

198198

Time Domain Filter Prediction

• Li’16 applied the filter prediction approach to the time domain with a more advanced acoustic model setting.

• Filter prediction. Use LSTMs to predict FIR filter weights for a two-channel array (17-161 taps long for 16kHz sampling rate)

• First LSTM layer shared by the 2 channels. Second LSTM layer splitted.

• Online prediction, no mean pooling.

• Gated feedback. Inputs to LSTM include signal samples and feedback from last hidden layer of acoustic model.

• Multi-task learning. Also reconstruct clean speech from early stage of acoustic model.

• No pre-training, use ASR cost function all the way (2000 hours training data)

(Li’16)

199199

Time Domain - Results

• Data and settings same as the filter learning experiments

Observations

• Using multi-task learning improves performance significantly (1.0% absolute).

• Gated feedback improves moderately (0.2% absolute).

• Frequency domain filter prediction is 0.5% worse than time domain filter prediction. • Perhaps because there are too many parameters to

predict in freq. domain and mean pooling is not used.

Comparison with Filter Learning

• Similar performance between filter learning and filter prediction.

(Li’16)

Methods

Word Error Rate (%)

Crossentropy

Sequential

Single channel 23.5 19.2

Delay-and-sum

22.4 18.8

Unfactoredfilter learning

21.7 17.5

Factored filter learning

20.4 17.1

Filter prediction

20.5 17.2

200200

Summary

• Network is able to learn the nonlinear mapping between input signals to filter weights (a highly nonlinear function)• Waveform to time domain weights (Li'16)• GCC or SCM features to frequency domain weights (Xiao'16a/b)

• Limitations:• Array geometry dependent network needs retraining for new geometry.

• Training recipe to be added to SignalGraph in a few days after the tutorial https://github.com/singaxiong/SignalGraph



Filter learningFilter prediction

More signal processing is used in filter prediction than in filter

learning.


201201

3.5 Discriminative Approach- mask prediction for filter estimation

202202

Mask estimation by neural networks

• Time frequency (TF) mask defines whether a TF bin in the spectrogram is dominated by speech or noise.

• The mask can be predicted by a neural network, and then used for speech enhancement.

• Mask is used as a filter and multiplied with noisy log power spectrum.

• In this section, we review the use of estimated mask for beamforming filter estimation.

• Before that, let’s first review the use of mask estimation in single-channel speech enhancement.

Elementwise Multiplication

Mask predictor network

Extract log power spectrum

Single-channel speech waveform

Mask 𝐦𝑡

Enhanced spectrogram

Mask estimation for single channel speech enhancement

203203

Single-channel joint optimization

• Mask estimator can be pre-trained by using ideal binary/ratio mask as target.

• The estimator can be fine tuned for ASR by using ASR cost function.

(Wang’16)

Elementwise Multiplication

DNN-based mask estimator


Single-channel speech waveform

Mask 𝐦𝑡

Feature Extraction

Acoustic Model

ASR cost

function

Features

Speech class posterior probabilities

True label

ASRbased

cost function

Enhanced log power spectrum

204204

Experiments on CHiME 2

System CE sMBR

Baseline (No SE front-end) 16.2 % 13.9 %

Mask estimation using CE 14.8 % 13.4 %

Mask estimation + retraining 15.5 % 13.9 %

Joint training of mask estimation and acoustic model

14.0 % 12.1 %

Large DNN-based acoustic model 15.2 % -

Results from (Wang’16)

Enhancement DNN- Predict mask (CE Objective function)- Features: Log power spectrumAcoustic model DNN- Log Mel Filterbanks- Trained on noisy speech with cross entropy (CE) or sMBR objective function

205205

Mask estimation for beamforming

• Use estimated mask to derive filter parameters.

• Compared to filter prediction network, mask prediction approach

• Uses more signal processing prior knowledge.

• Easier to achieve array geometry independency.


Array signals

Filter weights

Mask prediction network

Array signals

TF mask

Classic beamforming estimation equation

Filter weights

More specific role of neural net

206206

MVDR filter estimation using mask

• The estimated masks can also be used to compute the spatial covariance matrix (SCM) of speech and noise (Similar equations are used in section 2.2.2):

• 𝐑𝑓𝑠𝑝𝑒𝑒𝑐ℎ

=σ𝑡𝑀

𝑆 (𝑡,𝑓)𝐲𝑡,𝑓𝐲𝑡,𝑓𝐻

σ𝑡𝑀𝑆(𝑡,𝑓)


σ𝑡𝑀𝑁(𝑡,𝑓)𝐲𝑡,𝑓𝐲𝑡,𝑓

𝐻

σ𝑡𝑀𝑁(𝑡,𝑓)

• 𝑀𝑁 and 𝑀𝑆 are noise and speech masks respectively.

• 𝐲𝑡,𝑓 is the 𝑁 × 1 vector of observed Fourier transform coefficients of

time t and frequency 𝑓, 𝑁 is the number of channels.

• Use single mask value for all channels may not be optimal, may improve it in the future.

• Beamforming filters can be derived from the SCMs, e.g.

• 𝐰𝑓𝑀𝑉𝐷𝑅 =


−1𝐑𝑓𝑠𝑝𝑒𝑒𝑐ℎ

𝐮

𝑡𝑟𝑎𝑐𝑒[ 𝐑𝑓𝑛𝑜𝑖𝑠𝑒

−1𝐑𝑓𝑠𝑝𝑒𝑒𝑐ℎ

]

• 𝐮 is a column vector that specifies which channel is reference. The first channel is used as reference here. It’s possible to select the reference based on SNR or TDOA information.

• Also possible to use other beamformings such as generalized eigenvalue (GEV, Heymann’15/16).

Neural networks based mask predictor

Noisy log power spectrum

Used to derive beamforming

filters

Estimated mask

(Heymann’15, Hori’15, Heymann’16, Xiao’16b/17)

207207

LSTM-based mask estimation for beamforming(Xiao’16b/17)

• Combine the advantage of both neural network (for mask prediction) and traditional signal processing (portability).

• Masks derived from 1ch signals • Con: does not exploit spatial

information for mask estimation • Pro: array geometry

independent

Pooling

LSTM mask estimator (Process channels separately)

Apply STFT on channels

Noise MaskSpeech Mask

Pooling

Estimate Spatial Covariance Matrix

Obtain MVDR beamforming weights


Bemforming in frequency domain






𝑡𝑟𝑎𝑐𝑒[ 𝐑𝑓𝑛𝑜𝑖𝑠𝑒 −1


]𝒖


208208

Predicting speech and noise masks separately

• We can obtain noise mask from speech mask: 𝑚𝑡,𝑓𝑛 = 1 −𝑚𝑡,𝑓

𝑠

• We can also obtain noise mask independently use different projection transforms for speech and noise masks.

LSTM𝒙𝑡

∆𝒙𝑡

∆∆𝒙𝑡𝒉𝑡

𝒎𝑡𝒔

Affine

transform

& sigmoid

Affine

transform

& sigmoid

𝒎𝑡𝒏

(Xiao’16b/17)

209209

Refining LSTM Mask Predictor by ASR Cost(Xiao’16b,17)

Mask predictor directly minimizes ASR cost function.

Pooling

LSTM mask estimator (Process channels separately)



Pooling




Bemforming in frequency domain

Feature Extraction

Acoustic Model

Cross

entropy cost

Spatial covariance

Filter weights

Enhanced STFT coefficients

Features

Speech class posterior probabilities

True label


ASRbased

cost function

210210

Training Procedures

Use ideal binary mask to pre-train the mask estimator from simulated data.

Mask prediction

network

Beamforming in

Frequency Domain

Acoustic Model DNN






Short-Time Fourier

Transform (STFT)

ASR Cross

Entropy Cost

Single-channel speech signals of an utterance

Mask prediction

network

Mean square

error

Ideal binary mask of simulated data

1. Pre-training 2. Fine-tuning

MVDR filter estimation

equation

(Xiao’16b,17)

211211

Multi-pass mask estimation at run-time

• Mask estimation and beamforming can be performed iteratively in alternate order.

• Both mask and signal can be improved until convergence.

• 2-3 passes are enough.


LSTM mask estimator + pooling (Process channels separately)




Beamforming

LSTM mask estimator





Beamforming



LSTM mask estimator



Estimate Spatial Covariance

Matrix

Obtain MVDR beamforming

weights

Beamforming

Final enhanced

speech

Mask estimation

BeamformingNoisy signal

Enhanced signal

Estimated mask

Unfolded network of multi-pass mask estimation

212212

Experimental settings on CHiME-4 corpus

• 6-channel rectangular array mounted on a notepad computer (See section 1.1).

• Channel 2 not facing speaker low SNR.

• Near field scenario, different gains between channels.

• Mask prediction network (1 layer LSTM with 1024 memory cells)

• Pre-trained by 100 hours of 1-ch simulated data (WSJ0 + reverb + noise)

• Fine tuning on 18 hours of CHiME-4 training data data using cross entropy cost.

• Acoustic model: 7x2048 DNN.

• Trained by 15 hours simulated + 3 hours real data (CHiME-4 training data).

• Sequentially trained.

• 1 hours eval data

6 microphones mounted on a notepad

Real data are recorded in different environments: bus, cafeteria, pedestrian, and street.

213213

Estimated masks

Masks estimated by LSTM that is pre-trainedto predict speech ideal binary mask.

Masks estimated by LSTM that is refined to reduce ASR cross entropy cost.

Masks estimated by LSTM that is refined to reduce ASR cross entropy cost. Speech and

noise masks are separately estimated.

Observation• Noise masks generated by LSTM trained

with ASR cost is more sparse than that generated by IBM-approximating LSTM.

• Split masks (separately estimated noise and speech masks) are more sparser.

• Possible explanation: it’s good to only use confident TF bins for estimating spatial covariance matrices.

214214

Experiments on CHiME-4 (6 channel)Xiao’16b

Settings WER (%)

#ch for mask prediction ASR cost Split Mask Pooling #PassEval Real

data

1-channel track 21.6

Weighted delay-and-sum (BeamformIt, TDOA tracking + gain estimation) 13.6

Traditional MVDR (TDOA tracking + gain and noise estimation) 12.0

Frequency domain filter prediction (MSE training) 12.9

Use only the first channel

No

NoNo

1 12.8

Yes

1 10.9

Yes

1 10.1

Estimate masks for all 6 channels, then pool the masks

median

1 9.5

3 8.9

Effect of ASR cost function

Gain of separate noise and speech masks

Gain of mask pooling

Gain of multi-pass processing

Outperform delay-and-sum slightly.

Observation• Mask prediction + MVDR

formula produces good gain. • Filter prediction not working

well due to large gain difference between channels.

215215

Summary of Mask Estimation Network

• Advantages

• Single-channel mask estimation array geometry independent

• MVDR beamforming equation optimal noise reduction with no distortion

on speech, if speech and noise statistics are accurately estimated.

• Combining of the strengths of neural nets and traditional multi-channel

beamforming.

• Limitations:

• Not using inter-channel phase information sub-optimal performance.

• More limited use of neural networks less potential for improvement?

• Joint training recipe (Matlab) available in SignalGraph: https://github.com/singaxiong/SignalGraph/tree/master/examples/beamforming/mask_prediction

• Will add more comments, documentation, and bug fix.

https://github.com/singaxiong/SignalGraph/tree/master/examples/beamforming/mask_prediction

216216

Summary of Integrated Approach

• Neural networks are good mappers• Input to mask, input to filter weights, input to features, etc.

• Main challenge: how to incorporate human knowledge (signal processing) and deep learning to achieve: • Optimal ASR performance,

• Interpretability,

• Generalization to unseen room acoustics and noises,

• Array geometry independency.



Filter learning with/without phase

Filter prediction,Maximum Likelihood

Mask prediction + MVDR/GEV

217217

Table of Beamforming Methods

MethodsGenerative

model?Fixed looking directions?

UsephaseInfo?

Geometryindependent?

Good for unseen room and noise?

Joint training of SE and AM?

Use noise information

References

Delay-and-sum

No

No Yes Yes Yes

No training

from data

No Van Veen’88, Anguera’07

MVDR/GEV/max SNR Yes Capon’69, Van Veen’88

Maximum likelihood

Yes Optimized for GMM

Implicitly use noise through

noisy inputs

Seltzer’03, 04, Xiao’16b

Filter learning without phase

No

No looking

directionNo

May be not

Maybe, depends on

how well the model

generalizes

Yes, SEdirectly

optimizedfor DNN

AM

Marino’11, Swietojanski’13, Liu’14,

Swietojanski’1

Filter learning with phase

Yes

Yes

Hoshen’15, Sainath’15, 16

Filter prediction

No

Time domain: Li’16, Freq. domain: Xiao’16a,16c

Mask prediction YesExplicitly use noise estimate

Non-joint training: Heymann’15,16,

Erdogan’16, Joint CE training:

Xiao’16b,17

• Red colored properties are considered as disadvantageous, while greens are considered advantageous. • Implementation of MLBF, filter prediction, and mask prediction approaches are available in

SignalGraph --- a Matlab-based toolkit for applying deep learning technologies to signal processing. https://github.com/singaxiong/SignalGraph


218218

References

(Anguera’07) Anguera, X. et al. “Acoustic beamforming for speaker diarization of meetings,” IEEE Trans. ASLP (2007).

(Capon’69) Capon J. "High-resolution frequency-wavenumber spectrum analysis." Proceedings of the IEEE (1969).

(Doclo’02) Doclo, S., et al. “GSVD-based optimal filtering for single and multi-microphone speech enhancement,”. IEEE Trans. SP (2002).

(Erdogan’15) Erdogan, H., et al. “Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks,” Proc. ICASSP (2015).

(Heymann’15) Heymann, J. et al. “BLSTM supported GEV Beamformer front-end for the 3rd CHiME challenge,” Proc. ASRU (2015).

(Heymann’16) Heymann, J. et al. “Neural network based spectral mask estimation for acoustic beamforming,” Proc. ICASSP (2016).

(Higuchi’16) Higuchi, T. et al. “Robust MVDR beamforming using time frequency masks for online/offline ASR in noise,” Proc. ICASSP (2016).

(Hori’15) Hori, T., et al. “The MERL/SRI system for the 3rd CHiME challenge using beamforming, robust feature extraction, and advanced speech recognition,” Proc. of ASRU (2015).

(Hoshen’15) Hoshen, Y. et al. “Speech acoustic modeling from row multichannel waveforms”, Proc. ICASSP (2015)

(Knapp’76) Knapp, C.H. et al. “The generalized correlation method for estimation of time delay,” IEEE Trans. ASSP (1976).

(Li’16) Li, B. et al. “Neural Network Adaptive Beamforming for Robust Multichannel Speech Recognition,” Proc. INTERSPEECH (2016)

(Liu’14) Liu, Y., et al. “Using neural network front-ends on far field multiple microphones based speech recognition,” Proc. ICASSP (2014).

(Marino’11) Marino , D., et al. "An analysis of automatic speech recognition with multiple microphones.," Proc. Interspeech(2011).

(Narayanan’13) Narayanan, A. et al. “Ideal ratio mask estimation using deep neural networks for robust speech recognition,” Proc. ICASSP (2013).

219219

References

(Sainath’15) Sainath, T. et al. “Speaker location and microphone spacing invariant acoustic modeling from raw multichannel waveforms”, Proc. ASRU (2015).

(Sainath’16) Sainath, T. et al. “ Factored spatial and spectral multichannel raw waveform CLDNNs”, Proc. Interspeech (2016).

(Seltzer’03) Seltzer, M., et al. “Microphone Array Processing for Robust Speech Recognition,” PhD thesis (2003).

(Seltzer’04) Seltzer, M., et al. “Likelihood-maximizing beamforming for robust hands-free speech recognition,” IEEE Trans. ASP (2004).

(Swietojanski’13) Swietojanski, P., et al. “Hybrid acoustic models for distant and multichannel large vocabulary speech recognition,” Proc. ASRU (2013).

(Swietojanski’14a) Swietojanski, P., et al. “Convolutional neural networks for distant speech recognition,” IEEE Sig. Proc. Letters (2014).

(Swietojanski’14b) Swietojanski, P., et al. “Learning hidden unit contributions for unsupervised speaker adaptation of neural network acoustic models, “ Proc. SLT (2014).

(Van Veen’88) Van Veen, B.D. et al. “Beamforming: A versatile approach to spatial filtering,” IEEE ASSP (1988).

(Wang’16) Wang, Z.-Q. et al., “A Joint Training Framework for Robust automatic speech recognition,” IEEE/ACM Trans. ASLP (2016).

(Weninger’15) Weninger, F., et al, “Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR,” in Proc. Latent Variable Analysis and Signal Separation (2015).

(Xiao’15) Xiao, X. et al. “A learning-based approach to direction of arrival estimation in noisy and reverberant environments,” Proc. ICASSP (2015).

(Xiao’16a) Xiao, X. et al. “Deep beamforming networks for multi-channel speech recognition,” Proc. ICASSP (2016).

(Xiao’16b) Xiao, X. et al. “A Study of Learning Based Beamforming Methods for Speech Recognition,” Proc. CHiME-4 (2016).

(Xiao’16c) Xiao, X. et al. “Beamforming Networks Using Spatial Covariance Features for Far-field Speech Recognition,” Proc. APSIPA (2016).

(Xiao’16d) Xiao, X. et al. “Speech dereverberation for enhancement and recognition using dynamic features constrained deep neural networks and feature adaptation”, EURASIP Journal on Advances in Signal Processing (2016).

(Xiao’17) Xiao, X. et al. “On the time-frequency mask estimation for MVDR beamforming with application in robust speech recognition”, submitted to ICASSP (2017).

(Xu’15) Xu, Y. et al. “A regression approach to speech enhancement based on deep neural networks,” IEEE/ACM Trans. ASLP (2015).

221221

Recent outcomes

222222

Recent activities on multi-microphone speech recognition

ICSI meeting RATS

2005 2010 2015

CHiME-4 challenge April to August 2016We will introduce latest activities

223223

CHiME-4 challenge

• CHiME-4• revisits the datasets originally recorded for CHiME-

3• increases the level of difficulty by constraining the

number of mics available for testing.

• Three tracks:• 6ch (Equivalent to CHiME-3 challenge)• 2ch• 1ch

• Based on the outcomes of CHiME-3 challenge (CHiME-3 2nd best system[Hori’15]), state-of-the-art baseline recipe is created

• It’s in the official Kaldi repository• Kaldi github https://github.com/kaldi-

asr/kaldi/tree/master/egs/chime4• Everyone creates this baseline

https://github.com/kaldi-asr/kaldi/tree/master/egs/chime4

224224

Baseline enhancement

• BeamformIt

• https://github.com/xanguera/BeamformIt

• Developed for NIST RT project by X. Anguera (ICSI)

• Weighted delay and sum beamformer based on GCC

Original 5th channel signal

BeamformIt

https://github.com/xanguera/BeamformIt

225225

Baseline features, AM, and LM (Section 1)

• Features

• So called fMLLR features

• Needs GMM construction step

• Acoustic model (nnet1 in Kaldi)

• Noisy data training using 5th channel (Not enhancement data training)

• 7layer-2048neuron-DNN with sequence discriminative training (sMBR)

• Language model• 3-gram LM (provided by WSJ0)

• 5-gram LM (SRILM) and RNNLM rescoring (Mikorov’s RNNLM)

MFCCContext

Exp.LDA STC fMLLR

ContextExp. DNN

𝐨𝑡

226226

Baseline results (6ch track)

0

5

10

15

20

25

30

35

40

CHiME-3 baseline CHiME-4 baseline

Dev Real Dev Simu Test Real Test Simu

Significant error reduction → Everyone can start from this baseline

227227

Baseline results

0

5

10

15

20

25

30

35

1ch track 2ch track 6ch track

Channels (DNN+RNNLM)

Dev Real Dev Simu Test Real Test Simu

228228

6ch WER (Test Real)

• Best system: 2.24% (cf: CHiME-3 best: 5.83)• 8 among 15 systems outperform CHiME-3 best system

228

CHiME-4 results

2.24

11.51

5.83

0

2

4

6

8

10

12

14

229229

2ch WER

• Best system: 3.91% (6ch track best: 2.24%)

• Two systems outperform CHiME-3 6 channel best performance

229

CHiME-4 results

3.91

16.58

0

2

4

6

8

10

12

14

16

18

230230

1ch WER

230

CHiME-4 results

9.15 9.34 9.29

23.7

0

5

10

15

20

25

• Best systems: ~9.2% (6ch track best: 2.24%, 2ch track best: 3.91%)

• Still large gap between single and multi channel systems

231231

Successful front-ends

• Mask based beamforming (Section 2.2.2 or Section 3.5 to obtain robust special correlation)• Spatial clustering

• DNN/LSTM masking

• Top 3 systems (and many other systems) use this technique

• Variants of beamformer (Section 2.2.2)• MVDR beamformer

• Max-SNR beamformer

• Generalized Sidelobe Canceller (GSC)

• System combination of various beamformer outputs

232232

Successful back-ends

• Acoustic modeling (Section 1.3)• 6 channel training

• Use of enhanced data (mix it to 6 channel training or adaptation)

• Advanced acoustic model (but limited)• CNN

• BLSTM

• Model adaptation (simple retraining of model parameters)

• Language modeling (Section 1.2)• RNNLM -> LSTM

• New trend• Joint training (Section 3)

233233

Problem solved?

• Multi-channel, single device, limited vocabulary

(Tablet and smart phone scenarios)• Yes

• Single-channel scenario• No

• Multiple device, open vocabulary, very distant speech recognition• ????

234234

Conclusion

• Combining SE and ASR techniques greatly improves performance in severe conditions• ASR back-end technologies

• Feature extraction/transformation• RNN/LSTM/TDNN/CNN based acoustic modeling• Model adaptation, …

• SE front-end technologies• Microphone array,• Neural network-based speech enhancement, …

• Introduction of deep learning had a great impact on multi-microphone speech recognition• Large performance improvement• Reshuffling the importance of technologies• Joint training can further change the paradigm

• Reaching sufficient performance under reasonable conditions

• There remains many challenges and opportunities for further improvement

235235

Toward joint optimization?

• Joint training is a recent active research topic• Currently integrate front-end and acoustic model• Combined with end-to-end approaches it could introduce higher level cues to the

SE front-end (linguistic info…)

ASR back-end

SE front-end

ASR back-end

SE front-end

Separate optimization

• Both components are designed with different objective functions

Potentially SE front-end can be made more robust to unseen acoustic conditions (noise types, different mic configurations)

Not optimal for ASR

Joint optimization

• Both components are optimized with the same objective functions

Potentially more sensitive to mismatch between training and testing acoustic conditions

Optimal for ASR

236236

Dealing with uncertainties

• Advanced GMM-based systems exploited the uncertainty of the SE front-end during decoding (Uncertainty decoding)• Provided a way to interconnect speech enhancement front-end and ASR

back-end optimized with different criteria

• Exploiting uncertainty within DNN-based ASR systems has not been sufficiently explored yet• Joint training is one option

• Gaussian neural network?

• Are there other?

237237

More severe constraints

• Limited number of microphones• Best performances are obtained when exploiting multi-microphones

• Remains a great gap between performance with a single-microphone

Developing more powerful single-channel approaches remains an important research topic

• Many systems assume batch processing or utterance batch processing

Need further research for online & real-time processing

Headset

5.9 %Lapel

8.3 %8ch

9.0 %2ch

12.7 %1ch

17.4 %REVERB challenge

238238

More diverse acoustic conditions

• More challenging situations are waiting to be tackled• Dynamic conditions

• Multiple speakers • Moving speakers, …

• Various conditions • Variety of microphone types/numbers/configurations • Variety of acoustic conditions, rooms, noise types, SNRs, …

• More realistic conditions• Spontaneous speech• Unsegmented data• Microphone failures, …• Asyncronizations

• New directions• Distributed mic arrays, …

New technologies may be needed to tackle these issues

New corpora are needed to evaluate these technologies

239239

Larger multi-microphone speech corpora

• Some industrial players have access to large amount of field data…

… most publicly available DSR corpora are relatively small scale

• It has some advantages, Lower barrier of entry to the field Faster experimental turnaround New applications start with limited amount of available data

But…

Are the developed technologies still relevant when training data cover a large variety of conditions?

Could the absence of large corpora hinder the development of data demanding new technologies?

There is a need to create larger publicly available multi-microphone corpus

240240

DSR data simulation

• Low cost way to obtain large amount of data covering many conditions

• Only solution to obtain noisy/clean parallel corpora

• Distant microphone signals can be simulated as

• Good simulation requires measuring the room impulse responses and the noise signals in the same rooms with the same microphone array

• Still … • Some aspect are not modeled e.g. head movements• It is difficult to measure room impulse response in public spaces,… • How to generate multiple microphone signals

∗ +=

Microphone signal clean speech measured room impulse response

measured noise

241241

DSR data simulation

• Recent challenges results showed that• Simulated data help for acoustic model training

• No need for precise simulation

• Results on simulated data do not match results on real data when using an SE front-end• SE models match better to simulated data Causes overfitting

Need to develop better simulation techniques

242242

Toolkits

• ASR research has long history of community developed toolkits and recipes

• Toolkits and recipes are important to• Lower barrier of entrance

• Reproducibility of results

• Speedup progress in the field

• Recent distant/multi-microphone speech recognition recipes for REVERB and CHiME challenges include state-of-the-art back-end technologies

• Much less toolkits and recipes available for SE technologies

Community based development of SE toolkits could contribute to faster innovation for multi-microphone speech recognition

Deep learning for signal processing (Matlab): https://github.com/singaxiong/SignalGraph

SignalGraph

243243

Cross community

• Multi-microphone speech recognition research requires combination of• Multichannel SE front-end technologies

• ASR back-end technologies

Cross disciplinary area of research from speech enhancement, microphone array, ASR…

Recent challenges (CHiME, REVERB) and workshops (JSALT’15) have contributed to increase synergy between the communities by sharing• Common tasks

• Baseline systems

• Share knowledge• Edit book to appear “New Era for Robust Speech Recognition: Exploiting Deep Learning,”

Springer (2017)

244244

Thank you!

245245

Acknowledgments

246246

Acknowledgments

• We would like to thank our colleagues at NTT, MERL, MELCO, TUM, SRI, NTU and at the 2015 Jelinek summer workshop on Speech and Language technology (JSALT) for their direct or indirect contributions to the content of this tutorial.

• In particular we are deeply grateful to Tatsuya Kawahara (Kyoto Univ.), Vikramjit Mitra (SRI), Horacio E. Franco (SRI), Victor Abrash(SRI), Jonathan Le Roux (MERL), Takaaki Hori (MERL), Christian Huemmer (Erlangen Univ.), Tsubasa Ochiai (Doshisha Univ.), Shigeki Karita (NTT), Keisuke Kinoshita (NTT) and Tomohiro Nakatani (NTT) for their comments and suggestions on the content of the tutorial.

247247

Questions?

shinji watanabe - wordpress.com...chime 3, 4 •noisy speech recognition using a tablet •recording...

Documents