towards an improved modeling of the glottal source in statistical parametric speech synthesis

Towards an Improved Modeling of the Glottal Source in Statistical

Parametric Speech Synthesis

João P. Cabral, Steve Renals, Korin Richmond and Junichi Yamagishi

The Centre for Speech Technology ResearchThe University of Edinburgh

Outline

• Introduction• Voice source model• System• Perceptual evaluation• Concluding remarks• Future work

IntroductionHMM-based speech synthesizer [Tokuda et al]

Synthetic

Speech

Training speech

F0 extraction Spectral features estimation

spectrum

Pulse train

Noise component

Synthesis filter

analysis HMMs

• Source-filter model:

• Inverse filtering:

Voice source modelObtaining the glottal source signal

Source

Vocal tract

Lip radiation

d/dzSpeech

Inverse Filter

1/A(z)

Lip radiation

cancellation (∫)Speech

Voice source modelLiljencrants-Fant model (LF-model)

T : period

to : opening instant

tp : instant of max airflow

te : instant of max excitation

ta : return phase duration

tc : closing instant

Ee : excitation amplitude

Voice source modelOther parameters of the LF-model

quotient:

Return

quotient:

e at tOQ

Voice source modelDescription of the LF-model spectrum

Linear stylization of the LF-model spectrum

[Doval and d’Alessandro]

Fg glottal spectral peak

Fc spectral tilt

Voice source modelFeatures extraction

• utterances sampled at 16 kHz

• pitch-synchronous analysis (ESPS tools)

• LPCs calculated with windows centered at the glottal

epochs and duration 20ms

• inverse filtering to estimate DGS

• pre-emphasis filter (α=0.97)

• low-pass filtering of the residual at 4 kHz

Voice source modelEstimation of te and Ee

te and Ee are estimated from the pitch-marks

Voice source modelEstimation of tc, tp and to

max min

minct U

maxpt U

[Gobl & Chasaide]

Voice source modelEstimation of ta

Fs : sampling frequency

m : slope of the tangent at t=te

Curves of the LF-parameters for 2 voiced regions of an utterance

Voice source modelExamples of the estimated parameters

SystemGeneral description

- Nitech-HTS 2005 system

- STRAIGHT method for analysis and synthesis

- mixed multi-band excitation with phase manipulation /

pulse train

- Mel Log Spectrum Approximation (MLSA) filter

How was the LF-model integrated in the synthesizer?

SystemGeneration of the periodic excitation (pulse signal)

• Pulse centered within

the frame

• multiplied by

asymmetric widows

• summed with Gaussian

SystemPeriodic excitation with the LF-model

• 2 LF-waveforms

centered at the instant te

• multiplied by

asymmetric widows

• summed with Gaussian

SystemTechnical problem

Problem: the synthesis filter assumes the excitation to have a flat spectrum like the pulse train

Solution: Post-filter

Linear phase FIR filter:

-6dB/dec 1Hz ≤ f ≤ Fg (Hz)

+6dB/dec Fg < f ≤ Fc (Hz)

+12dB/dec Fc < f ≤ 16 kHz

SystemEffect of the post-filtering

Perceptual evaluationGeneration of the stimuli

• Built US-English voice EM001 provided by ATR for the Blizzard

Challenge

• Glottal parameters were measured in 8 utterances and the mean

values were calculated

• Simple excitation, without multi-band noise or phase

manipulation

• Ten utterances were synthesized, using the LF-model and the

pulse model

Perceptual evaluationExperiment

• Forced-choice test

• Presented via a web-interface browser

• Subjects were asked if they used headphones or speakers, and

if they were native speakers (U.K./U.S.)

• 18 listeners (7 native speakers of English)

• Listeners panel was mainly university students and staff

Pulse: LF-model:

Example of test speech signals:

Perceptual evaluationResults

Excitation

LF-Model Pulse train

Non-native speakers

61% 39%

Native speakers 68.6% 31.4%

Total scores and 95% CI

64% ± 6.7% 36% ± 6.7%

Conclusions

• Nitech-HTS 2005 speech synthesizer was implemented with the LF-

model for the voice source

• Results showed that the LF-model can give better speech quality

than the traditionally used pulse train

• Direct methods used for the estimation of the mean LF-parameters

seemed to perform well

• A technical problem with the integration of the LF-model in the

system was solved using a post-filter

Future work

• To find better analysis/synthesis methods to use with the LF-model in

the HMM-based speech synthesis

• To evaluate the speech quality when using the mixed excitation with

the LF-model

• To implement voice quality transformations using the LF-model

• To evaluate the parameterization methods

• To model the glottal parameters with HMMs

Acknowledgements

This work was financially supported by the Marie Curie EdSST programme.

Thank you!

towards an improved modeling of the glottal source in statistical parametric speech synthesis

Documents

development of an electromagnetic glottal waveform sensor...

glottal and vocal tract characteristics of voice...

improved semi-parametric time series models of air...

frequency domain interpretation and derivation of glottal

glottal stop and checked consonants in bonda.pdf

the relationship between sonority and glottal vibration -...

variation of glottal lf parameters across f0, vowels and...

estimation of the glottal pulse from speech or … ·...

review of glottal waveform analysis

glottal source and excitation analysis

§ al-hamzah the glottal stop in classical …...al-hamzah...

glottal fry in college aged females: an entrainment ... ·...

aliasing-free implementation of discrete-time glottal...

adjustment of glottal con gurations in...

extracting sub-glottal and supra-glottal features from...

parametric study of contact fritting for improved … ·...

glottal stop examples in english energo

glottal sounds in the chavacáno language

parametric & non-parametric

spectral relevance of glottal jet interaction with