issam bazzi, alex acero, and li deng microsoft research one microsoft way redmond, wa, usa 2003
DESCRIPTION
AN EXPECTATION MAXIMIZATION APPROACH FOR FORMANT TRACKING USING A PARAMETER-FREE NON-LINEAR PREDICTOR. Issam Bazzi, Alex Acero, and Li Deng Microsoft Research One Microsoft Way Redmond, WA, USA 2003. Outline. Introduction The Model EM Training Format Tracking Experiment Results - PowerPoint PPT PresentationTRANSCRIPT
AN EXPECTATION MAXIMIZATION APPROACH FOR FORMANT TRACKING USING A
PARAMETER-FREE NON-LINEAR PREDICTOR
Issam Bazzi, Alex Acero, and Li DengMicrosoft ResearchOne Microsoft Way
Redmond, WA, USA2003
Ts'ai, Chung-Ming, Speech Lab, NTUST, 2007 2/14
Outline
IntroductionThe ModelEM TrainingFormat TrackingExperiment ResultsConclusion
Ts'ai, Chung-Ming, Speech Lab, NTUST, 2007 3/14
Introduction
Traditional methods use LPC or matching stored templates of spectral cross sections
In either case, formant tracking is error-prone due to not enough candidates or templates
This paper uses a predictor codebook of MFCC to present formant relationships
Also, this method explores the complete formant space, avoiding premature elimination in LPC or template matching
Ts'ai, Chung-Ming, Speech Lab, NTUST, 2007 4/14
The Model
ot = F(xt) + rt
ot is observed MFCC coefficientsxt is vocal tract resonances (VTR) and corr
esponding bandwidthsF(xt) is the quantized frequency and band
width of formants, named predictor codebook
rt is the residual signal
Ts'ai, Chung-Ming, Speech Lab, NTUST, 2007 5/14
Constructing F(x) All-pole model Assume there are I formants x = (F1, B1, F2, B2,……, FI, BI) Then use z-transfrom to get H(z):
Finally, each quantized VTR x can be transformed into a MFCC series F(x)
Ts'ai, Chung-Ming, Speech Lab, NTUST, 2007 6/14
EM Training (1/2)
Use a single Gaussian to model rt
T frames utterance, θ is parameters (mean and covariance) of Gaussian
Assume formant values x are uniformly distributed, and can take any of C quantized values
Ts'ai, Chung-Ming, Speech Lab, NTUST, 2007 7/14
EM Training (2/2)
Ts'ai, Chung-Ming, Speech Lab, NTUST, 2007 8/14
Formant Tracking (1/2)
Frame-by-Frame Tracking Formants in each frame are estimated
independently One-to-one Mapping (MAP)
Minimum Mean Squared Error (MMSE)
Ts'ai, Chung-Ming, Speech Lab, NTUST, 2007 9/14
Formant Tracking (2/2) Tracking with Continuity Constraints
First Order State Model: xt = xt-1 + wt
wt is modeled as a Gaussian with zero mean and diagonal Σw
MAP method below can be estimated using Viterbi search
MMSE is more much complex and this paper uses an approximate method to obtain, which is not well described here
Ts'ai, Chung-Ming, Speech Lab, NTUST, 2007 10/14
Experiment Settings
Track 3 formants Frequencies are first mapped
on mel-scale then uniformly quantized
Bandwidths are simply uniformly quantized
F1 < F2 < F3, so totally 767500 entries in codebook
Gain = 1 MFCC is 12 dimension, witho
ut C0 20 utterances of one male sp
eaker are used for EM
Ts'ai, Chung-Ming, Speech Lab, NTUST, 2007 11/14
Experiment Results, “they were what”
Ts'ai, Chung-Ming, Speech Lab, NTUST, 2007 12/14
Experiment Results, with bandwidth
Ts'ai, Chung-Ming, Speech Lab, NTUST, 2007 13/14
Experiment Results, residual
Ts'ai, Chung-Ming, Speech Lab, NTUST, 2007 14/14
Conclusion
This method is totally unsupervised, needless of any labeling
Works well in unvoiced framesNo gross errorsMay be applied to speech
recognizing system