analysis-by-synthesis for source separation and speech recognition · 2015-09-10 · motivation:...
TRANSCRIPT
Analysis-by-synthesisfor source separation and speech recognition
Michael I [email protected]
Brooklyn College (CUNY)
Joint work with Young Suk Cho and Arun Narayanan (Ohio State)
Columbia Neural Network Seminar SeriesSeptember 8, 2015
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 1 / 73
Outline
1 Motivation: need for noise robustness
2 Non-parametric synthesis for speech enhancement
3 Parametric synthesis for speech recognition
4 Summary
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 2 / 73
Motivation: need for noise robustness
Outline
1 Motivation: need for noise robustnessNeed for better mobile voice qualityNeed for noise robust automatic speech recognition (ASR)Main challenge
2 Non-parametric synthesis for speech enhancement
3 Parametric synthesis for speech recognition
4 Summary
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 3 / 73
Motivation: need for noise robustness Need for better mobile voice quality
Outline
1 Motivation: need for noise robustnessNeed for better mobile voice qualityNeed for noise robust automatic speech recognition (ASR)Main challenge
2 Non-parametric synthesis for speech enhancement
3 Parametric synthesis for speech recognition
4 Summary
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 4 / 73
Motivation: need for noise robustness Need for better mobile voice quality
Need for better mobile voice quality
There are now more mobile devices than humans on earth1
But recording conditions for these devices leave much to be desired
Can we recover high quality speech from noisy & degraded recordings?
1http://www.independent.co.uk/life-style/gadgets-and-tech/news/
there-are-officially-more-mobile-devices-than-people-in-the-world-9780518.html
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 5 / 73
Motivation: need for noise robustness Need for better mobile voice quality
Why mobile voice quality stinks2
2Je↵ Hecht. Why mobile voice quality still stinks—and how to fix it. IEEE Spectrum, September 2014
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 6 / 73
Motivation: need for noise robustness Need for better mobile voice quality
Why mobile voice quality stinks2
2Je↵ Hecht. Why mobile voice quality still stinks—and how to fix it. IEEE Spectrum, September 2014
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 6 / 73
Motivation: need for noise robustness Need for noise robust automatic speech recognition (ASR)
Outline
1 Motivation: need for noise robustnessNeed for better mobile voice qualityNeed for noise robust automatic speech recognition (ASR)Main challenge
2 Non-parametric synthesis for speech enhancement
3 Parametric synthesis for speech recognition
4 Summary
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 7 / 73
Motivation: need for noise robustness Need for noise robust automatic speech recognition (ASR)
Conversational mobile software agents
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 8 / 73
Source: Tom Vanleenhove
Motivation: need for noise robustness Need for noise robust automatic speech recognition (ASR)
Conversational mobile software agents need to work in
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 9 / 73
Source: Flickr user rickihuang
Motivation: need for noise robustness Need for noise robust automatic speech recognition (ASR)
Conversational mobile software agents need to work in
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 9 / 73
Source: Flickr user retorta net
Motivation: need for noise robustness Need for noise robust automatic speech recognition (ASR)
Conversational mobile software agents need to work in
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 9 / 73
Source: Flickr user Brian Indrelunas
Motivation: need for noise robustness Need for noise robust automatic speech recognition (ASR)
But automatic speech recognition doesn’t work there3
3Amit Juneja. A comparison of automatic and human speech recognition in null grammar. The Journal of the Acoustical
Society of America, 131(3):EL256–EL261, February 2012
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 10 / 73
Motivation: need for noise robustness Main challenge
Outline
1 Motivation: need for noise robustnessNeed for better mobile voice qualityNeed for noise robust automatic speech recognition (ASR)Main challenge
2 Non-parametric synthesis for speech enhancement
3 Parametric synthesis for speech recognition
4 Summary
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 11 / 73
Motivation: need for noise robustness Main challenge
Main challenge
Speech is a rich signal, it requires rich models
Synthesis models are rich enough to represent almost all speech
Non-parametric synthesis models for high qualityDNN as non-linear distance function
Parametric synthesis models for e�cient representatione�cient gradient-based optimization of input (not model)
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 12 / 73
Motivation: need for noise robustness Main challenge
Main challenge
Speech is a rich signal, it requires rich models
Synthesis models are rich enough to represent almost all speech
Non-parametric synthesis models for high qualityDNN as non-linear distance function
Parametric synthesis models for e�cient representatione�cient gradient-based optimization of input (not model)
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 12 / 73
Non-parametric synthesis for speech enhancement
Outline
1 Motivation: need for noise robustness
2 Non-parametric synthesis for speech enhancementOverviewDeep neural network as nonlinear distance functionUsing this DNN for speech enhancementNoise suppression experimentsAudio super-resolution experimentsSummary
3 Parametric synthesis for speech recognition
4 Summary
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 13 / 73
Non-parametric synthesis for speech enhancement Overview
Outline
1 Motivation: need for noise robustness
2 Non-parametric synthesis for speech enhancementOverviewDeep neural network as nonlinear distance functionUsing this DNN for speech enhancementNoise suppression experimentsAudio super-resolution experimentsSummary
3 Parametric synthesis for speech recognition
4 Summary
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 14 / 73
Non-parametric synthesis for speech enhancement Overview
Concatenative resynthesis for speech enhancement4,5
Standard approaches try to modify noisy recordings
We instead resynthesize a clean version of the same speech
Should produce infinite suppression and high speech quality
4Michael I Mandel, Young-Suk Cho, and Yuxuan Wang. Learning a concatenative resynthesis system for noise suppression.
In Proc. IEEE GlobalSIP, 20145Michael I Mandel and Young Suk Cho. Audio super-resolution using concatenative resynthesis. In Proc. IEEE WASPAA,
2015. To appear
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 15 / 73
Non-parametric synthesis for speech enhancement Overview
Motivating example
Your phone records your voice in quiet, close-talk conditions
Uses those recordings to replace your voice in noisy, far-talk conditions
Resynthesizes your speech from previous high-quality recordings
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 16 / 73
Non-parametric synthesis for speech enhancement Overview
Concatenative resynthesis
Use a large dictionary of ⇠200 ms “chunks” of audio
Learn DNN-based a�nity between dictionary & mixture chunks
Perform concatenative synthesis of signal from dictionary
General robust supervised nonlinear signal mapping framework
Task Map from To
Noise suppression Noisy CleanAudio super-resolution Reverberated, compressed Clean
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 17 / 73
Non-parametric synthesis for speech enhancement Deep neural network as nonlinear distance function
Outline
1 Motivation: need for noise robustness
2 Non-parametric synthesis for speech enhancementOverviewDeep neural network as nonlinear distance functionUsing this DNN for speech enhancementNoise suppression experimentsAudio super-resolution experimentsSummary
3 Parametric synthesis for speech recognition
4 Summary
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 18 / 73
Non-parametric synthesis for speech enhancement Deep neural network as nonlinear distance function
Deep neural network as nonlinear distance function6
Generative Discriminative Dictionary-based
Data-intensive training Moderate training data Data-e�cient trainingHard to adapt Hard to adapt Very adaptable
6Michael I Mandel, Young-Suk Cho, and Yuxuan Wang. Learning a concatenative resynthesis system for noise suppression.
In Proc. IEEE GlobalSIP, 2014
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 19 / 73
Non-parametric synthesis for speech enhancement Deep neural network as nonlinear distance function
Train DNN on correctly and incorrectly paired chunks
Noise suppression
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 20 / 73
Non-parametric synthesis for speech enhancement Deep neural network as nonlinear distance function
Train DNN on correctly and incorrectly paired chunks
Audio super-resolution
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 21 / 73
Non-parametric synthesis for speech enhancement Using this DNN for speech enhancement
Outline
1 Motivation: need for noise robustness
2 Non-parametric synthesis for speech enhancementOverviewDeep neural network as nonlinear distance functionUsing this DNN for speech enhancementNoise suppression experimentsAudio super-resolution experimentsSummary
3 Parametric synthesis for speech recognition
4 Summary
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 22 / 73
Non-parametric synthesis for speech enhancement Using this DNN for speech enhancement
Find optimal sequence of clean chunks
x = {xt}Tt=0 input sequence of noisy chunks
z = {zt}Tt=0 best sequence of corresponding dictionary chunks
A�nity between clean and noisy chunks
Transition a�nity between clean chunks
z = argmaxz
Y
t
p(zt = j | xt) p(zt = j | zt�1 = i)
= argmaxz
Y
i
g(zj , xi ) Tij
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 23 / 73
Non-parametric synthesis for speech enhancement Using this DNN for speech enhancement
Find optimal sequence of clean chunks
x = {xt}Tt=0 input sequence of noisy chunks
z = {zt}Tt=0 best sequence of corresponding dictionary chunks
A�nity between clean and noisy chunks
Transition a�nity between clean chunks
z = argmaxz
Y
t
p(zt = j | xt) p(zt = j | zt�1 = i)
= argmaxz
Y
i
g(zj , xi ) Tij
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 23 / 73
Non-parametric synthesis for speech enhancement Using this DNN for speech enhancement
Find optimal sequence of clean chunks
x = {xt}Tt=0 input sequence of noisy chunks
z = {zt}Tt=0 best sequence of corresponding dictionary chunks
A�nity between clean and noisy chunks
Transition a�nity between clean chunks
z = argmaxz
Y
t
p(zt = j | xt) p(zt = j | zt�1 = i)
= argmaxz
Y
i
g(zj , xi ) Tij
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 23 / 73
Non-parametric synthesis for speech enhancement Using this DNN for speech enhancement
Compare all pairs of noisy and clean chunks
D1D2D3...
M1M1M1...
DNN
Observed mixture
Clean dictionary
Similarity
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 24 / 73
Non-parametric synthesis for speech enhancement Using this DNN for speech enhancement
Compare all pairs of noisy and clean chunks
D1D2D3...
M2M2M2...
DNN
Observed mixture
Clean dictionary
Similarity
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 24 / 73
Non-parametric synthesis for speech enhancement Using this DNN for speech enhancement
Compare all pairs of noisy and clean chunks
D1D2D3...
M3M3M3...
DNN
Observed mixture
Clean dictionary
Similarity
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 24 / 73
Non-parametric synthesis for speech enhancement Using this DNN for speech enhancement
Compare all pairs of noisy and clean chunks
D1D2D3...
M4M4M4...
DNN
Observed mixture
Clean dictionary
Similarity
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 24 / 73
Non-parametric synthesis for speech enhancement Using this DNN for speech enhancement
Compare all pairs of noisy and clean chunks
D1D2D3...
M5M5M5...
DNN
Observed mixture
Clean dictionary
Similarity
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 24 / 73
Non-parametric synthesis for speech enhancement Using this DNN for speech enhancement
Compare all pairs of noisy and clean chunks
D1D2D3...
MNMNMN...
DNN
Observed mixture
Clean dictionary
Similarity
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 24 / 73
Non-parametric synthesis for speech enhancement Using this DNN for speech enhancement
Standard Viterbi algorithm for to find optimal sequence
D1D2D3...
MNMNMN...
DNN
Observed mixture
Clean dictionary
Similarity
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 25 / 73
Non-parametric synthesis for speech enhancement Using this DNN for speech enhancement
Standard Viterbi algorithm for to find optimal sequence
D1D2D3...
MNMNMN...
DNN
Observed mixture
Clean dictionary
Similarity
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 25 / 73
Non-parametric synthesis for speech enhancement Noise suppression experiments
Outline
1 Motivation: need for noise robustness
2 Non-parametric synthesis for speech enhancementOverviewDeep neural network as nonlinear distance functionUsing this DNN for speech enhancementNoise suppression experimentsAudio super-resolution experimentsSummary
3 Parametric synthesis for speech recognition
4 Summary
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 26 / 73
Non-parametric synthesis for speech enhancement Noise suppression experiments
Original “clean” speech
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 27 / 73
Non-parametric synthesis for speech enhancement Noise suppression experiments
Noisy speech
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 28 / 73
Non-parametric synthesis for speech enhancement Noise suppression experiments
Traditional mask-based separation
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 29 / 73
Non-parametric synthesis for speech enhancement Noise suppression experiments
Concatenative resynthesis output
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 30 / 73
Non-parametric synthesis for speech enhancement Noise suppression experiments
Original “clean” speech
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 31 / 73
Non-parametric synthesis for speech enhancement Noise suppression experiments
Subjective quality is high
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 32 / 73
Non-parametric synthesis for speech enhancement Noise suppression experiments
Subjective quality is high
20 40 60 80 100
Noisy
IRM NN
Concat
Clean
Quality (higher better)
SpeechNoise SupOverall
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 32 / 73
Non-parametric synthesis for speech enhancement Noise suppression experiments
Subjective intelligibility is ok
60 70 80 90 100
Noisy
IRM NN
Concat
Clean
Words correctly identified (%)
Keywords
All words
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 33 / 73
Non-parametric synthesis for speech enhancement Audio super-resolution experiments
Outline
1 Motivation: need for noise robustness
2 Non-parametric synthesis for speech enhancementOverviewDeep neural network as nonlinear distance functionUsing this DNN for speech enhancementNoise suppression experimentsAudio super-resolution experimentsSummary
3 Parametric synthesis for speech recognition
4 Summary
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 34 / 73
Non-parametric synthesis for speech enhancement Audio super-resolution experiments
Original clean speech
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 35 / 73
Non-parametric synthesis for speech enhancement Audio super-resolution experiments
Reverberated, compressed, 20% packet loss
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 36 / 73
Non-parametric synthesis for speech enhancement Audio super-resolution experiments
NMF-based bandwidth expansion output
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 37 / 73
Non-parametric synthesis for speech enhancement Audio super-resolution experiments
Concatenative resynthesis output
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 38 / 73
Non-parametric synthesis for speech enhancement Audio super-resolution experiments
Original clean speech
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 39 / 73
Non-parametric synthesis for speech enhancement Audio super-resolution experiments
Subjective quality is high
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 40 / 73
Non-parametric synthesis for speech enhancement Audio super-resolution experiments
Subjective quality is high
0
10
20
30
40
50
60
70
80
90
100
MU
SH
RA
Sco
re
CleanClean (hid)RevRev 8kHzRevOpusL20RevOpusL20 (hid)
CleanAmr RevAmr RevOpusL20
InputNMFConcat
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 40 / 73
Non-parametric synthesis for speech enhancement Audio super-resolution experiments
Subjective intelligibility is good
70
75
80
85
90
95
100
Corr
ect
word
s (%
)
CleanRevRev 8kHzRevOpusL20
CleanAmr RevAmr RevOpusL20
InputNMFConcat
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 41 / 73
Non-parametric synthesis for speech enhancement Summary
Outline
1 Motivation: need for noise robustness
2 Non-parametric synthesis for speech enhancementOverviewDeep neural network as nonlinear distance functionUsing this DNN for speech enhancementNoise suppression experimentsAudio super-resolution experimentsSummary
3 Parametric synthesis for speech recognition
4 Summary
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 42 / 73
Non-parametric synthesis for speech enhancement Summary
Summary
Concatenative synthesizer, DNN as noise-robust selection function
Instead of modifying noisy speech, replace itcompletely eliminates noise, except for synthesis errorsproduces high quality, natural-sounding speech
General robust supervised nonlinear signal mapping framework
Data-e�cient to train and adaptable to new talkers
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 43 / 73
Non-parametric synthesis for speech enhancement Summary
Future applications
Generalize to audio-visual speech recognition
Label dictionary elements ahead of time to enablenoise-robust non-parametric speech recognitionnoise-robust pitch trackingnoise-robust speaker identification
Incorporate language model into transition cost
Develop e�cient search mechanisms for large-vocabulary dictionaries
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 44 / 73
Parametric synthesis for speech recognition
Outline
1 Motivation: need for noise robustness
2 Non-parametric synthesis for speech enhancement
3 Parametric synthesis for speech recognitionOverviewAlgorithmResultsSummary
4 Summary
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 45 / 73
Parametric synthesis for speech recognition Overview
Outline
1 Motivation: need for noise robustness
2 Non-parametric synthesis for speech enhancement
3 Parametric synthesis for speech recognitionOverviewAlgorithmResultsSummary
4 Summary
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 46 / 73
Parametric synthesis for speech recognition Overview
Mask-based source separation: Noisy
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 47 / 73
Parametric synthesis for speech recognition Overview
Mask-based source separation: Masked
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 48 / 73
Parametric synthesis for speech recognition Overview
Disrupts speech features: Noisy MFCCs
“He said such products would be marketed by othercompanies with experience him at this month.”
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 49 / 73
Parametric synthesis for speech recognition Overview
Disrupts speech features: Masked MFCCs
“He said such products would be marketed by othercompanies with experience him at this month.”
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 50 / 73
Parametric synthesis for speech recognition Overview
Disrupts speech features: Clean MFCCs
“He said such products would be marketed by othercompanies with experience in that business.”
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 51 / 73
Parametric synthesis for speech recognition Overview
Estimate better features using a strong prior model
“He said such products would be marketed by othercompanies with experience in that business.”
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 52 / 73
Parametric synthesis for speech recognition Overview
Our approach: Analysis-by-synthesis
Synthesize speech signal so that itlooks like the observationlooks like speech
Itakura-Saito divergence compares prediction with noisy observation
Recognizer gives likelihood of speech-ness
Both easy to optimize using gradient descent
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 53 / 73
Parametric synthesis for speech recognition Overview
Speech recognizer includes lots of information
Large vocabulary continuous speech recognizer captures:
Acoustics of speech sounds
The e↵ect of neighboring speech sounds
Pronunciation of words
Order of words
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 54 / 73
Parametric synthesis for speech recognition Algorithm
Outline
1 Motivation: need for noise robustness
2 Non-parametric synthesis for speech enhancement
3 Parametric synthesis for speech recognitionOverviewAlgorithmResultsSummary
4 Summary
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 55 / 73
Parametric synthesis for speech recognition Algorithm
Optimization over speech features
x: optimization state: MFCCs, ⇠10,000 dimensions
y(x): ASR features derived from x
M: mask provided a priori by another source separator
minx
L(x;M) = minx
n
(1� ↵) LI (x;M) + ↵ LH(y(x))o
Total cost
Distance to noisy observation
Negative log likelihood under recognizer
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 56 / 73
Parametric synthesis for speech recognition Algorithm
Optimization over speech features
x: optimization state: MFCCs, ⇠10,000 dimensions
y(x): ASR features derived from x
M: mask provided a priori by another source separator
minx
L(x;M) = minx
n
(1� ↵) LI (x;M) + ↵ LH(y(x))o
Total cost
Distance to noisy observation
Negative log likelihood under recognizer
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 56 / 73
Parametric synthesis for speech recognition Algorithm
Optimization over speech features
x: optimization state: MFCCs, ⇠10,000 dimensions
y(x): ASR features derived from x
M: mask provided a priori by another source separator
minx
L(x;M) = minx
n
(1� ↵) LI (x;M) + ↵ LH(y(x))o
Total cost
Distance to noisy observation
Negative log likelihood under recognizer
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 56 / 73
Parametric synthesis for speech recognition Algorithm
Analysis of audio meets resynthesis of MFCCs at mask
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 57 / 73
Parametric synthesis for speech recognition Algorithm
LI (x;M): Distance to noisy observation
Resynthesize MFCCs to power spectrum, where mask was computed
Do mask-aware comparison in that domain: weighted Itakura-Saitobetween resynthesis, S!t(x), and noisy observation, Sweighted by mask, M
LI (x;M) = DM(S k S) =X
!,t
M!t
✓
S!t
S!t(x)� log
S!t
S!t(x)� 1
◆
Does not require modeling speech excitation
Numerically di↵erentiable with respect to x
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 58 / 73
Parametric synthesis for speech recognition Algorithm
LH(y(x)): Likelihood under recognizer
Large vocabulary continuous speech recognizerbig hidden Markov model (HMM)approximated by the lattice of likely paths
Closed form gradient with respect to x
Serves as a model of clean MFCC sequences
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 59 / 73
Parametric synthesis for speech recognition Algorithm
LH(y(x)): Likelihood under recognizer
Large vocabulary continuous speech recognizerbig hidden Markov model (HMM)approximated by the lattice of likely paths
Closed form gradient with respect to x
Serves as a model of clean MFCC sequences
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 59 / 73
Parametric synthesis for speech recognition Algorithm
Optimization
State space of approximately 13⇥ 800 ⇡10,000 dimensions
Quasi-Newton optimization, BFGSgradient plus approximate second-order information
Closed form gradient of HMM likelihoodusing a forward-backward algorithm
Numerical gradient of IS divergenceindependent costs and gradients for each frame
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 60 / 73
Parametric synthesis for speech recognition Results
Outline
1 Motivation: need for noise robustness
2 Non-parametric synthesis for speech enhancement
3 Parametric synthesis for speech recognitionOverviewAlgorithmResultsSummary
4 Summary
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 61 / 73
Parametric synthesis for speech recognition Results
Experiment
AURORA4 corpusread Wall Street Journal sentences (5000 word vocabulary)six environmental noise typesSNRs between 5 and 15 dB
Masks from ideal binary mask and estimated ratio mask7
7Arun Narayanan and DeLiang Wang. Ideal ratio mask estimation using deep neural networks for robust speech
recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pages 7092–7096.IEEE, May 2013
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 62 / 73
Parametric synthesis for speech recognition Results
Recognition results
Word error rate (%) averaged across noise type
Mask Direct A-by-S
Noisy 30.94Estimated 16.18 15.31Oracle 14.38 13.62Clean 9.54
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 63 / 73
Parametric synthesis for speech recognition Results
Reconstruction results
Itakura-Saito divergence between resynthesized speech and original
Mask Direct A-by-S �
Noisy 272301Estimated 276497 275224 �1273Oracle 273006 272506 �500
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 64 / 73
Parametric synthesis for speech recognition Results
Resynthesis gets closer to reliable regions
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 65 / 73
Parametric synthesis for speech recognition Results
Resynthesis gets closer to reliable regions
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 66 / 73
Parametric synthesis for speech recognition Results
Resynthesis gets closer to reliable regions
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 67 / 73
Parametric synthesis for speech recognition Results
Resynthesis gets closer to reliable regions
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 68 / 73
Parametric synthesis for speech recognition Summary
Outline
1 Motivation: need for noise robustness
2 Non-parametric synthesis for speech enhancement
3 Parametric synthesis for speech recognitionOverviewAlgorithmResultsSummary
4 Summary
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 69 / 73
Parametric synthesis for speech recognition Summary
Summary
Use a full recognizer as a prior model for clean speech
Synthesize from MFCCs to the domain of the mask
Adjust synthesis of speech signal so that itlooks like the observationlooks like speech
Reduces recognition errors, distance to clean utterance
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 70 / 73
Parametric synthesis for speech recognition Summary
Future directions
Apply to DNN-based acoustic models
Model speech excitation for full resynthesis of clean speech
Model multiple simultaneous speakers and estimate masks jointly
Combine with similar binaural model to include spatial clustering
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 71 / 73
Summary
Outline
1 Motivation: need for noise robustness
2 Non-parametric synthesis for speech enhancement
3 Parametric synthesis for speech recognition
4 Summary
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 72 / 73
Summary
Summary
Synthesizers provide strong prior information
Non-parametric synthesis models for high qualitylearned nonlinear matching function for perceptually motivated features
Parametric synthesis models for e�cient representationstrong, di↵erentiable prior model of speech
Thanks!
Any questions?
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 73 / 73
Summary
Summary
Synthesizers provide strong prior information
Non-parametric synthesis models for high qualitylearned nonlinear matching function for perceptually motivated features
Parametric synthesis models for e�cient representationstrong, di↵erentiable prior model of speech
Thanks!
Any questions?
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 73 / 73
Summary
Summary
Synthesizers provide strong prior information
Non-parametric synthesis models for high qualitylearned nonlinear matching function for perceptually motivated features
Parametric synthesis models for e�cient representationstrong, di↵erentiable prior model of speech
Thanks!
Any questions?
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 73 / 73
Parametric synthesis for separation
Outline
5 Parametric synthesis for separation
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 1 / 3
Parametric synthesis for separation
Re-estimate mask using resynthesis: Original
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 2 / 3
Parametric synthesis for separation
Re-estimate mask using resynthesis: Re-estimate
Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 3 / 3