improvement of probabilistic acoustic tube model for...

1
IMPROVEMENT OF PROBABILISTIC ACOUSTIC TUBE MODEL FOR SPEECH DECOMPOSITION Yang Zhang 1 [email protected] Zhijian Ou 2 [email protected] Mark Hasegawa-Johnson 1 [email protected] 1 Uni versity of Illinois at Urbana Champaign, Urbana IL, USA Department of Electrical and Computer Engineering 2 Tsinghua University, Beijing, China Department of Electronic Engineering Motivation Drawbacks of Current Model-based Methods Highlights of PAT2 Incomplete - tend to model only a part of parameters of interest, and disregard others that might also be important. Speech analysis may be inaccurate or even incorrect : Chicken and egg effect; LPC and MFCC corrupted by spectral tilt. A probabilistic generative model that jointly considers all speech parameters; Incorporates breathiness and glottal vibration; Incorporates phase modeling and so completely defines a probabilistic model for the complex spectrum of speech; Makes U/V states a continuum by introducing voiced amplitude and unvoiced amplitude, which is closer to the nature of speech. PAT2 Signal Modeling The Source Filter Model with Mixed Excitation Glottal Filter Vocal Tract Filter Impulse train Glottal filter Vocal tract filter Voiced speech Unvoiced Excitation Unvoiced Speech usually dominates =0 voiced excitation = + = 0 = 1+ 1 cos + 1 2 −2 −1 1+ 2 −1 magnitude & phase of the anti-causal poles = exp =1 exp mel-frequency complex cepstral coefficients When negative sign and group delay is removed, complex cepstrum decay at the rate of 1/ . Experimental Results Voiced Reconstruction Voiced Reconstruction – Single Frame GCI Location Estimation Vocal Tract Filter Estimation Voiced vs Whispered Original Spectrogram Voiced Reconstruct Real Spectrum Imaginary Spectrum PAT2 MFCC PAT2 MFCC PAT2 Probabilistic Modeling Convert to DFT and Vectorize Convert to Mel Frequency Add Prior = + ~ 0, = = + ~ , 2 | −1 | ∝− 2 2 Analysis with MAP unvoiced excitation dirac delta function vocal tract transfer function windowing function glottal transfer function group delay speech magnitude of the causal pole quefrency mel-frequency fundamental frequency

Upload: others

Post on 18-Nov-2019

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: IMPROVEMENT OF PROBABILISTIC ACOUSTIC TUBE MODEL FOR ...oa.ee.tsinghua.edu.cn/~ouzhijian/pdf/icassp2014_pat2_poster2.pdf · Drawbacks of Current Model-based Methods Highlights of

IMPROVEMENT OF PROBABILISTIC ACOUSTIC TUBE MODEL FOR SPEECH DECOMPOSITION

Yang Zhang1

[email protected] Ou2

[email protected] Hasegawa-Johnson1

[email protected]

1 University of Illinois at Urbana Champaign, Urbana IL, USADepartment of Electrical and Computer Engineering

2 Tsinghua University, Beijing, ChinaDepartment of Electronic Engineering

MotivationDrawbacks of Current Model-based Methods Highlights of PAT2

• Incomplete - tend to model only a part of parameters of interest,and disregard others that might also be important.

• Speech analysis may be inaccurate or even incorrect:• Chicken and egg effect;• LPC and MFCC corrupted by spectral tilt.

• A probabilistic generative model that jointly considers all speech parameters;

• Incorporates breathiness and glottal vibration;• Incorporates phase modeling and so completely defines a

probabilistic model for the complex spectrum of speech;• Makes U/V states a continuum by introducing voiced amplitude and

unvoiced amplitude, which is closer to the nature of speech.

PAT2 Signal ModelingThe Source Filter Model with Mixed Excitation

Glottal Filter Vocal Tract Filter

Impulse train

Glottal filter

Vocal tract filter

Voiced speech

Unvoiced Excitation

Unvoiced Speech

𝑎𝑡 usually dominates 𝑏𝑡

𝑏𝑡

𝑎𝑡

𝑎𝑡 = 0

voiced excitation

𝑆𝑡 𝜔 = 𝑎𝑡𝑉𝑡 𝜔 + 𝑏𝑡𝑈𝑡 𝜔 𝐻𝑡 𝜔 ⊛ 𝑊𝑡 𝜔

𝑉𝑡 𝜔 = 𝐺𝑡 𝜔 𝑒−𝑗𝜔𝜏𝑡

𝑘

𝛿 𝜔 − 𝑘𝜔0𝑡

𝐺𝑡 𝜔 = 1 + 𝑔1𝑡 cos 𝛽𝑡𝑒−𝑗𝑤 + 𝑔1𝑡

2 𝑒−2𝑗𝜔 −1

1 + 𝑔2𝑡𝑒−𝑗𝜔 −1

magnitude & phase of the anti-causal poles

𝐻𝑡 𝜔 = exp

𝑛=1

𝐾

ℎ 𝑛 exp −𝑗𝑚 𝜔 𝑛

mel-frequency complex cepstral coefficientsWhen negative sign and group delay is removed, complex cepstrum decay at the rate of 1/ 𝑛.

Experimental ResultsVoiced Reconstruction Voiced Reconstruction – Single Frame GCI Location Estimation

Vocal Tract Filter Estimation Voiced vs Whispered

Original Spectrogram

Voiced Reconstruct

Real Spectrum

Imaginary Spectrum

PAT2 MFCC PAT2 MFCC

PAT2 Probabilistic ModelingConvert to DFT and Vectorize Convert to Mel Frequency Add Prior

𝒔𝑡 = 𝑎𝑡𝝃𝑡 + 𝑏𝑡𝜼𝑡

𝜼𝑡~𝒩 0,𝑯𝑡

𝒔𝑡 = 𝑭𝒔𝑡 = 𝑎𝑡𝑭𝝃𝑡 + 𝑏𝑡𝑭𝜼𝑡

𝒔𝑡~𝒩 𝑎𝑡𝑭𝝃𝑡, 𝑏𝑡2𝑭𝑯𝑡𝑭

𝑇

𝑃𝜃𝑡|𝜃𝑡−1𝑢|𝑣 ∝ −

𝑢 − 𝑣 2

𝜎𝜃2

Analysis with MAP

unvoiced excitation

dirac delta function

vocal tract transfer function

windowing function

glottal transfer function

group delay

speech

magnitude of the causal pole quefrency mel-frequency

𝐻𝑡

𝑉𝑡

𝑈𝑡 fundamental frequency