issam bazzi, alex acero, and li deng microsoft research one microsoft way redmond, wa, usa 2003

14
AN EXPECTATION MAXIMIZATION APPRO ACH FOR FORMANT TRACKING USING A PARAMETER-FREE NON-LINEAR PREDICT OR Issam Bazzi, Alex Acero, and Li Deng Microsoft Research One Microsoft Way Redmond, WA, USA 2003

Upload: quinn-nielsen

Post on 30-Dec-2015

23 views

Category:

Documents


0 download

DESCRIPTION

AN EXPECTATION MAXIMIZATION APPROACH FOR FORMANT TRACKING USING A PARAMETER-FREE NON-LINEAR PREDICTOR. Issam Bazzi, Alex Acero, and Li Deng Microsoft Research One Microsoft Way Redmond, WA, USA 2003. Outline. Introduction The Model EM Training Format Tracking Experiment Results - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Issam Bazzi, Alex Acero, and Li Deng Microsoft Research One Microsoft Way Redmond, WA, USA 2003

AN EXPECTATION MAXIMIZATION APPROACH FOR FORMANT TRACKING USING A

PARAMETER-FREE NON-LINEAR PREDICTOR

Issam Bazzi, Alex Acero, and Li DengMicrosoft ResearchOne Microsoft Way

Redmond, WA, USA2003

Page 2: Issam Bazzi, Alex Acero, and Li Deng Microsoft Research One Microsoft Way Redmond, WA, USA 2003

Ts'ai, Chung-Ming, Speech Lab, NTUST, 2007 2/14

Outline

IntroductionThe ModelEM TrainingFormat TrackingExperiment ResultsConclusion

Page 3: Issam Bazzi, Alex Acero, and Li Deng Microsoft Research One Microsoft Way Redmond, WA, USA 2003

Ts'ai, Chung-Ming, Speech Lab, NTUST, 2007 3/14

Introduction

Traditional methods use LPC or matching stored templates of spectral cross sections

In either case, formant tracking is error-prone due to not enough candidates or templates

This paper uses a predictor codebook of MFCC to present formant relationships

Also, this method explores the complete formant space, avoiding premature elimination in LPC or template matching

Page 4: Issam Bazzi, Alex Acero, and Li Deng Microsoft Research One Microsoft Way Redmond, WA, USA 2003

Ts'ai, Chung-Ming, Speech Lab, NTUST, 2007 4/14

The Model

ot = F(xt) + rt

ot is observed MFCC coefficientsxt is vocal tract resonances (VTR) and corr

esponding bandwidthsF(xt) is the quantized frequency and band

width of formants, named predictor codebook

rt is the residual signal

Page 5: Issam Bazzi, Alex Acero, and Li Deng Microsoft Research One Microsoft Way Redmond, WA, USA 2003

Ts'ai, Chung-Ming, Speech Lab, NTUST, 2007 5/14

Constructing F(x) All-pole model Assume there are I formants x = (F1, B1, F2, B2,……, FI, BI) Then use z-transfrom to get H(z):

Finally, each quantized VTR x can be transformed into a MFCC series F(x)

Page 6: Issam Bazzi, Alex Acero, and Li Deng Microsoft Research One Microsoft Way Redmond, WA, USA 2003

Ts'ai, Chung-Ming, Speech Lab, NTUST, 2007 6/14

EM Training (1/2)

Use a single Gaussian to model rt

T frames utterance, θ is parameters (mean and covariance) of Gaussian

Assume formant values x are uniformly distributed, and can take any of C quantized values

Page 7: Issam Bazzi, Alex Acero, and Li Deng Microsoft Research One Microsoft Way Redmond, WA, USA 2003

Ts'ai, Chung-Ming, Speech Lab, NTUST, 2007 7/14

EM Training (2/2)

Page 8: Issam Bazzi, Alex Acero, and Li Deng Microsoft Research One Microsoft Way Redmond, WA, USA 2003

Ts'ai, Chung-Ming, Speech Lab, NTUST, 2007 8/14

Formant Tracking (1/2)

Frame-by-Frame Tracking Formants in each frame are estimated

independently One-to-one Mapping (MAP)

Minimum Mean Squared Error (MMSE)

Page 9: Issam Bazzi, Alex Acero, and Li Deng Microsoft Research One Microsoft Way Redmond, WA, USA 2003

Ts'ai, Chung-Ming, Speech Lab, NTUST, 2007 9/14

Formant Tracking (2/2) Tracking with Continuity Constraints

First Order State Model: xt = xt-1 + wt

wt is modeled as a Gaussian with zero mean and diagonal Σw

MAP method below can be estimated using Viterbi search

MMSE is more much complex and this paper uses an approximate method to obtain, which is not well described here

Page 10: Issam Bazzi, Alex Acero, and Li Deng Microsoft Research One Microsoft Way Redmond, WA, USA 2003

Ts'ai, Chung-Ming, Speech Lab, NTUST, 2007 10/14

Experiment Settings

Track 3 formants Frequencies are first mapped

on mel-scale then uniformly quantized

Bandwidths are simply uniformly quantized

F1 < F2 < F3, so totally 767500 entries in codebook

Gain = 1 MFCC is 12 dimension, witho

ut C0 20 utterances of one male sp

eaker are used for EM

Page 11: Issam Bazzi, Alex Acero, and Li Deng Microsoft Research One Microsoft Way Redmond, WA, USA 2003

Ts'ai, Chung-Ming, Speech Lab, NTUST, 2007 11/14

Experiment Results, “they were what”

Page 12: Issam Bazzi, Alex Acero, and Li Deng Microsoft Research One Microsoft Way Redmond, WA, USA 2003

Ts'ai, Chung-Ming, Speech Lab, NTUST, 2007 12/14

Experiment Results, with bandwidth

Page 13: Issam Bazzi, Alex Acero, and Li Deng Microsoft Research One Microsoft Way Redmond, WA, USA 2003

Ts'ai, Chung-Ming, Speech Lab, NTUST, 2007 13/14

Experiment Results, residual

Page 14: Issam Bazzi, Alex Acero, and Li Deng Microsoft Research One Microsoft Way Redmond, WA, USA 2003

Ts'ai, Chung-Ming, Speech Lab, NTUST, 2007 14/14

Conclusion

This method is totally unsupervised, needless of any labeling

Works well in unvoiced framesNo gross errorsMay be applied to speech

recognizing system