towards an improved modeling of the glottal source in statistical parametric speech synthesis
DESCRIPTION
Towards an Improved Modeling of the Glottal Source in Statistical Parametric Speech Synthesis. João P. Cabral, Steve Renals, Korin Richmond and Junichi Yamagishi. The Centre for Speech Technology Research The University of Edinburgh. Outline. Introduction Voice source model System - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Towards an Improved Modeling of the Glottal Source in Statistical Parametric Speech Synthesis](https://reader035.vdocuments.us/reader035/viewer/2022062810/56815b17550346895dc8c798/html5/thumbnails/1.jpg)
Towards an Improved Modeling of the Glottal Source in Statistical
Parametric Speech Synthesis
João P. Cabral, Steve Renals, Korin Richmond and Junichi Yamagishi
The Centre for Speech Technology ResearchThe University of Edinburgh
![Page 2: Towards an Improved Modeling of the Glottal Source in Statistical Parametric Speech Synthesis](https://reader035.vdocuments.us/reader035/viewer/2022062810/56815b17550346895dc8c798/html5/thumbnails/2.jpg)
2
Outline
• Introduction• Voice source model• System• Perceptual evaluation• Concluding remarks• Future work
![Page 3: Towards an Improved Modeling of the Glottal Source in Statistical Parametric Speech Synthesis](https://reader035.vdocuments.us/reader035/viewer/2022062810/56815b17550346895dc8c798/html5/thumbnails/3.jpg)
3
IntroductionHMM-based speech synthesizer [Tokuda et al]
Text
Synthetic
Speech
F0
Training speech
F0 extraction Spectral features estimation
spectrum
Pulse train
Noise component
Synthesis filter
Text
analysis HMMs
+
![Page 4: Towards an Improved Modeling of the Glottal Source in Statistical Parametric Speech Synthesis](https://reader035.vdocuments.us/reader035/viewer/2022062810/56815b17550346895dc8c798/html5/thumbnails/4.jpg)
4
• Source-filter model:
• Inverse filtering:
Voice source modelObtaining the glottal source signal
Source
Ug
Vocal tract
A(z)
Lip radiation
d/dzSpeech
Inverse Filter
1/A(z)
Lip radiation
cancellation (∫)Speech
ˆgdU
ˆgU
![Page 5: Towards an Improved Modeling of the Glottal Source in Statistical Parametric Speech Synthesis](https://reader035.vdocuments.us/reader035/viewer/2022062810/56815b17550346895dc8c798/html5/thumbnails/5.jpg)
5
Voice source modelLiljencrants-Fant model (LF-model)
T : period
to : opening instant
tp : instant of max airflow
te : instant of max excitation
ta : return phase duration
tc : closing instant
Ee : excitation amplitude
![Page 6: Towards an Improved Modeling of the Glottal Source in Statistical Parametric Speech Synthesis](https://reader035.vdocuments.us/reader035/viewer/2022062810/56815b17550346895dc8c798/html5/thumbnails/6.jpg)
6
Voice source modelOther parameters of the LF-model
Open
quotient:
Speed
quotient:
Return
quotient:
e at tOQ
T
p
e p
tSQ
t t
atRQT
![Page 7: Towards an Improved Modeling of the Glottal Source in Statistical Parametric Speech Synthesis](https://reader035.vdocuments.us/reader035/viewer/2022062810/56815b17550346895dc8c798/html5/thumbnails/7.jpg)
7
Voice source modelDescription of the LF-model spectrum
Linear stylization of the LF-model spectrum
[Doval and d’Alessandro]
Fg glottal spectral peak
Fc spectral tilt
![Page 8: Towards an Improved Modeling of the Glottal Source in Statistical Parametric Speech Synthesis](https://reader035.vdocuments.us/reader035/viewer/2022062810/56815b17550346895dc8c798/html5/thumbnails/8.jpg)
8
Voice source modelFeatures extraction
• utterances sampled at 16 kHz
• pitch-synchronous analysis (ESPS tools)
• LPCs calculated with windows centered at the glottal
epochs and duration 20ms
• inverse filtering to estimate DGS
• pre-emphasis filter (α=0.97)
• low-pass filtering of the residual at 4 kHz
![Page 9: Towards an Improved Modeling of the Glottal Source in Statistical Parametric Speech Synthesis](https://reader035.vdocuments.us/reader035/viewer/2022062810/56815b17550346895dc8c798/html5/thumbnails/9.jpg)
9
Voice source modelEstimation of te and Ee
te and Ee are estimated from the pitch-marks
![Page 10: Towards an Improved Modeling of the Glottal Source in Statistical Parametric Speech Synthesis](https://reader035.vdocuments.us/reader035/viewer/2022062810/56815b17550346895dc8c798/html5/thumbnails/10.jpg)
10
Voice source modelEstimation of tc, tp and to
max min
max
2o
U Ut
E
minct U
maxpt U
[Gobl & Chasaide]
![Page 11: Towards an Improved Modeling of the Glottal Source in Statistical Parametric Speech Synthesis](https://reader035.vdocuments.us/reader035/viewer/2022062810/56815b17550346895dc8c798/html5/thumbnails/11.jpg)
11
Voice source modelEstimation of ta
ea
s
Et
mF
Fs : sampling frequency
m : slope of the tangent at t=te
![Page 12: Towards an Improved Modeling of the Glottal Source in Statistical Parametric Speech Synthesis](https://reader035.vdocuments.us/reader035/viewer/2022062810/56815b17550346895dc8c798/html5/thumbnails/12.jpg)
12
Curves of the LF-parameters for 2 voiced regions of an utterance
Voice source modelExamples of the estimated parameters
![Page 13: Towards an Improved Modeling of the Glottal Source in Statistical Parametric Speech Synthesis](https://reader035.vdocuments.us/reader035/viewer/2022062810/56815b17550346895dc8c798/html5/thumbnails/13.jpg)
13
SystemGeneral description
- Nitech-HTS 2005 system
- STRAIGHT method for analysis and synthesis
- mixed multi-band excitation with phase manipulation /
pulse train
- Mel Log Spectrum Approximation (MLSA) filter
How was the LF-model integrated in the synthesizer?
![Page 14: Towards an Improved Modeling of the Glottal Source in Statistical Parametric Speech Synthesis](https://reader035.vdocuments.us/reader035/viewer/2022062810/56815b17550346895dc8c798/html5/thumbnails/14.jpg)
14
SystemGeneration of the periodic excitation (pulse signal)
• Pulse centered within
the frame
• multiplied by
asymmetric widows
• summed with Gaussian
noise
![Page 15: Towards an Improved Modeling of the Glottal Source in Statistical Parametric Speech Synthesis](https://reader035.vdocuments.us/reader035/viewer/2022062810/56815b17550346895dc8c798/html5/thumbnails/15.jpg)
15
SystemPeriodic excitation with the LF-model
• 2 LF-waveforms
centered at the instant te
• multiplied by
asymmetric widows
• summed with Gaussian
noise
![Page 16: Towards an Improved Modeling of the Glottal Source in Statistical Parametric Speech Synthesis](https://reader035.vdocuments.us/reader035/viewer/2022062810/56815b17550346895dc8c798/html5/thumbnails/16.jpg)
16
SystemTechnical problem
Problem: the synthesis filter assumes the excitation to have a flat spectrum like the pulse train
Solution: Post-filter
Linear phase FIR filter:
-6dB/dec 1Hz ≤ f ≤ Fg (Hz)
+6dB/dec Fg < f ≤ Fc (Hz)
+12dB/dec Fc < f ≤ 16 kHz
![Page 17: Towards an Improved Modeling of the Glottal Source in Statistical Parametric Speech Synthesis](https://reader035.vdocuments.us/reader035/viewer/2022062810/56815b17550346895dc8c798/html5/thumbnails/17.jpg)
17
SystemEffect of the post-filtering
![Page 18: Towards an Improved Modeling of the Glottal Source in Statistical Parametric Speech Synthesis](https://reader035.vdocuments.us/reader035/viewer/2022062810/56815b17550346895dc8c798/html5/thumbnails/18.jpg)
18
Perceptual evaluationGeneration of the stimuli
• Built US-English voice EM001 provided by ATR for the Blizzard
Challenge
• Glottal parameters were measured in 8 utterances and the mean
values were calculated
• Simple excitation, without multi-band noise or phase
manipulation
• Ten utterances were synthesized, using the LF-model and the
pulse model
![Page 19: Towards an Improved Modeling of the Glottal Source in Statistical Parametric Speech Synthesis](https://reader035.vdocuments.us/reader035/viewer/2022062810/56815b17550346895dc8c798/html5/thumbnails/19.jpg)
19
Perceptual evaluationExperiment
• Forced-choice test
• Presented via a web-interface browser
• Subjects were asked if they used headphones or speakers, and
if they were native speakers (U.K./U.S.)
• 18 listeners (7 native speakers of English)
• Listeners panel was mainly university students and staff
Pulse: LF-model:
Example of test speech signals:
![Page 20: Towards an Improved Modeling of the Glottal Source in Statistical Parametric Speech Synthesis](https://reader035.vdocuments.us/reader035/viewer/2022062810/56815b17550346895dc8c798/html5/thumbnails/20.jpg)
20
Perceptual evaluationResults
Excitation
LF-Model Pulse train
Non-native speakers
61% 39%
Native speakers 68.6% 31.4%
Total scores and 95% CI
64% ± 6.7% 36% ± 6.7%
![Page 21: Towards an Improved Modeling of the Glottal Source in Statistical Parametric Speech Synthesis](https://reader035.vdocuments.us/reader035/viewer/2022062810/56815b17550346895dc8c798/html5/thumbnails/21.jpg)
21
Conclusions
• Nitech-HTS 2005 speech synthesizer was implemented with the LF-
model for the voice source
• Results showed that the LF-model can give better speech quality
than the traditionally used pulse train
• Direct methods used for the estimation of the mean LF-parameters
seemed to perform well
• A technical problem with the integration of the LF-model in the
system was solved using a post-filter
![Page 22: Towards an Improved Modeling of the Glottal Source in Statistical Parametric Speech Synthesis](https://reader035.vdocuments.us/reader035/viewer/2022062810/56815b17550346895dc8c798/html5/thumbnails/22.jpg)
22
Future work
• To find better analysis/synthesis methods to use with the LF-model in
the HMM-based speech synthesis
• To evaluate the speech quality when using the mixed excitation with
the LF-model
• To implement voice quality transformations using the LF-model
• To evaluate the parameterization methods
• To model the glottal parameters with HMMs
![Page 23: Towards an Improved Modeling of the Glottal Source in Statistical Parametric Speech Synthesis](https://reader035.vdocuments.us/reader035/viewer/2022062810/56815b17550346895dc8c798/html5/thumbnails/23.jpg)
23
Acknowledgements
This work was financially supported by the Marie Curie EdSST programme.
Thank you!