parametric speech synthesis (d3l5 deep learning for speech and language upc 2017)
TRANSCRIPT
![Page 1: Parametric Speech Synthesis (D3L5 Deep Learning for Speech and Language UPC 2017)](https://reader031.vdocuments.us/reader031/viewer/2022030214/5899c4731a28ab45548b5739/html5/thumbnails/1.jpg)
[course site]
Day 3 Lecture 5
Parametric Speech SynthesisAntonio Bonafonte
![Page 2: Parametric Speech Synthesis (D3L5 Deep Learning for Speech and Language UPC 2017)](https://reader031.vdocuments.us/reader031/viewer/2022030214/5899c4731a28ab45548b5739/html5/thumbnails/2.jpg)
2
Main TTS Technologies
Concatenative speech synthesis + Unit SelectionConcatenate best prerecorded speech unitsSpeech data: 2-10 hours, professional speaker, carefully segmenten and annotated.
![Page 3: Parametric Speech Synthesis (D3L5 Deep Learning for Speech and Language UPC 2017)](https://reader031.vdocuments.us/reader031/viewer/2022030214/5899c4731a28ab45548b5739/html5/thumbnails/3.jpg)
3
Concatenative
H. Zen - RTTHSS 2015http://rtthss2015.talp.cat/
![Page 4: Parametric Speech Synthesis (D3L5 Deep Learning for Speech and Language UPC 2017)](https://reader031.vdocuments.us/reader031/viewer/2022030214/5899c4731a28ab45548b5739/html5/thumbnails/4.jpg)
Statistical Speech Synthesis
H. Zen - RTTHSS 2015http://rtthss2015.talp.cat/
![Page 5: Parametric Speech Synthesis (D3L5 Deep Learning for Speech and Language UPC 2017)](https://reader031.vdocuments.us/reader031/viewer/2022030214/5899c4731a28ab45548b5739/html5/thumbnails/5.jpg)
5
Main TTS Technologies
Concatenative speech synthesis + Unit SelectionConcatenate best pre-recorded speech units
Statistical Parametric Speech Synthesis Represent speech waveform using parameters (eg 5ms)Use statistic generative modelReconstruct waveform from generated parameters
Hybrid SystemsConcatenative speech synthesisSelect best units attending a statistical parametric system
![Page 6: Parametric Speech Synthesis (D3L5 Deep Learning for Speech and Language UPC 2017)](https://reader031.vdocuments.us/reader031/viewer/2022030214/5899c4731a28ab45548b5739/html5/thumbnails/6.jpg)
6
Deep architectures … but not deep (yet)
6
Text to Speech: Textual features → Spectrum of speech (many coefficients)
TXTDesigned
feature extraction
ft 1
ft 2
ft 3
Regression module
s1
s2
s3
wavegen
“Hand-crafted” features
“Hand-crafted” features
![Page 7: Parametric Speech Synthesis (D3L5 Deep Learning for Speech and Language UPC 2017)](https://reader031.vdocuments.us/reader031/viewer/2022030214/5899c4731a28ab45548b5739/html5/thumbnails/7.jpg)
7
Textual features (x)
From text to phoneme (pronunciation)Disambiguation, pronuntiation (e.g.: Jan. 26)
From phoneme to phoneme+ (with linguistic features)
![Page 8: Parametric Speech Synthesis (D3L5 Deep Learning for Speech and Language UPC 2017)](https://reader031.vdocuments.us/reader031/viewer/2022030214/5899c4731a28ab45548b5739/html5/thumbnails/8.jpg)
8
Textual features (x)● {preceding, succeeding} two phonemes● Position of current phoneme in current syllable● # of phonemes at {preceding, current, succeeding} syllable● {accent, stress} of {preceding, current, succeeding} syllable● Position of current syllable in current word● # of {preceding, succeeding} {stressed, accented} syllables in
phrase● # of syllables {from previous, to next} {stressed, accented}
syllable● Guess at part of speech of {preceding, current, succeeding} word● # of syllables in {preceding, current, succeeding} word● Position of current word in current phrase● # of {preceding, succeeding} content words in current phrase● # of words {from previous, to next} content word● # of syllables in {preceding, current, succeeding} phrase
H. Zen - RTTHSS 2015http://rtthss2015.talp.cat/
![Page 9: Parametric Speech Synthesis (D3L5 Deep Learning for Speech and Language UPC 2017)](https://reader031.vdocuments.us/reader031/viewer/2022030214/5899c4731a28ab45548b5739/html5/thumbnails/9.jpg)
Statistical Speech Synthesis
H. Zen - RTTHSS 2015http://rtthss2015.talp.cat/
![Page 10: Parametric Speech Synthesis (D3L5 Deep Learning for Speech and Language UPC 2017)](https://reader031.vdocuments.us/reader031/viewer/2022030214/5899c4731a28ab45548b5739/html5/thumbnails/10.jpg)
10
Speech features (y)
![Page 11: Parametric Speech Synthesis (D3L5 Deep Learning for Speech and Language UPC 2017)](https://reader031.vdocuments.us/reader031/viewer/2022030214/5899c4731a28ab45548b5739/html5/thumbnails/11.jpg)
11
Speech features (y)
![Page 12: Parametric Speech Synthesis (D3L5 Deep Learning for Speech and Language UPC 2017)](https://reader031.vdocuments.us/reader031/viewer/2022030214/5899c4731a28ab45548b5739/html5/thumbnails/12.jpg)
12
Speech features (y)
![Page 13: Parametric Speech Synthesis (D3L5 Deep Learning for Speech and Language UPC 2017)](https://reader031.vdocuments.us/reader031/viewer/2022030214/5899c4731a28ab45548b5739/html5/thumbnails/13.jpg)
13
Speech features (y)
● Rate: ~ 5 ms. (200Hz)
● Spectral features (envelope)
● Excitation features (fundamental frequency, pitch)
● Representation that allows reconstruction: vocoders (Straight, Ahocoder, ...)
![Page 14: Parametric Speech Synthesis (D3L5 Deep Learning for Speech and Language UPC 2017)](https://reader031.vdocuments.us/reader031/viewer/2022030214/5899c4731a28ab45548b5739/html5/thumbnails/14.jpg)
14
Regression
14
TXTDesigned
feature extraction
ft 1
ft 2
ft 3
Regression module
s1
s2
s3
wavegen
![Page 15: Parametric Speech Synthesis (D3L5 Deep Learning for Speech and Language UPC 2017)](https://reader031.vdocuments.us/reader031/viewer/2022030214/5899c4731a28ab45548b5739/html5/thumbnails/15.jpg)
15
Phoneme rate vs. frame rate
H. Zen - RTTHSS 2015http://rtthss2015.talp.cat/
![Page 16: Parametric Speech Synthesis (D3L5 Deep Learning for Speech and Language UPC 2017)](https://reader031.vdocuments.us/reader031/viewer/2022030214/5899c4731a28ab45548b5739/html5/thumbnails/16.jpg)
16
Duration Modeling
H. Zen - RTTHSS 2015http://rtthss2015.talp.cat/
![Page 17: Parametric Speech Synthesis (D3L5 Deep Learning for Speech and Language UPC 2017)](https://reader031.vdocuments.us/reader031/viewer/2022030214/5899c4731a28ab45548b5739/html5/thumbnails/17.jpg)
17
Acoustic Modeling
![Page 18: Parametric Speech Synthesis (D3L5 Deep Learning for Speech and Language UPC 2017)](https://reader031.vdocuments.us/reader031/viewer/2022030214/5899c4731a28ab45548b5739/html5/thumbnails/18.jpg)
18
Acoustic Modeling: DNN
H. Zen - RTTHSS 2015http://rtthss2015.talp.cat/
![Page 19: Parametric Speech Synthesis (D3L5 Deep Learning for Speech and Language UPC 2017)](https://reader031.vdocuments.us/reader031/viewer/2022030214/5899c4731a28ab45548b5739/html5/thumbnails/19.jpg)
19
Acoustic Modeling: DNN
![Page 20: Parametric Speech Synthesis (D3L5 Deep Learning for Speech and Language UPC 2017)](https://reader031.vdocuments.us/reader031/viewer/2022030214/5899c4731a28ab45548b5739/html5/thumbnails/20.jpg)
20
Regression using DNN (problem)
H. Zen - RTTHSS 2015http://rtthss2015.talp.cat/
![Page 21: Parametric Speech Synthesis (D3L5 Deep Learning for Speech and Language UPC 2017)](https://reader031.vdocuments.us/reader031/viewer/2022030214/5899c4731a28ab45548b5739/html5/thumbnails/21.jpg)
21
Mixture density network (MDN)
H. Zen - RTTHSS 2015http://rtthss2015.talp.cat/
![Page 22: Parametric Speech Synthesis (D3L5 Deep Learning for Speech and Language UPC 2017)](https://reader031.vdocuments.us/reader031/viewer/2022030214/5899c4731a28ab45548b5739/html5/thumbnails/22.jpg)
22
Mixture density network (MDN)
H. Zen - RTTHSS 2015http://rtthss2015.talp.cat/
![Page 23: Parametric Speech Synthesis (D3L5 Deep Learning for Speech and Language UPC 2017)](https://reader031.vdocuments.us/reader031/viewer/2022030214/5899c4731a28ab45548b5739/html5/thumbnails/23.jpg)
23
Recurrent Networks: LSTM
H. Zen - RTTHSS 2015http://rtthss2015.talp.cat/
![Page 24: Parametric Speech Synthesis (D3L5 Deep Learning for Speech and Language UPC 2017)](https://reader031.vdocuments.us/reader031/viewer/2022030214/5899c4731a28ab45548b5739/html5/thumbnails/24.jpg)
24
Recurrent Networks: LSTM
Boxplot of mean opinion score
SPPS: Models built using HTSUS: Ogmios unit selection systemLSTM: Duration and Acoustic LSTM modelsLSTM-pf: Post-filteredAhocoded: Analysis/Synthesis (vocoder effect)Natural: Human recording
Source [MSc Pascual]
![Page 25: Parametric Speech Synthesis (D3L5 Deep Learning for Speech and Language UPC 2017)](https://reader031.vdocuments.us/reader031/viewer/2022030214/5899c4731a28ab45548b5739/html5/thumbnails/25.jpg)
25
Multi-speaker
![Page 26: Parametric Speech Synthesis (D3L5 Deep Learning for Speech and Language UPC 2017)](https://reader031.vdocuments.us/reader031/viewer/2022030214/5899c4731a28ab45548b5739/html5/thumbnails/26.jpg)
26
Multi-speaker
Boxplot of subjective preference:-2: multioutput system preferred+2 speaker dependent system preferred
![Page 27: Parametric Speech Synthesis (D3L5 Deep Learning for Speech and Language UPC 2017)](https://reader031.vdocuments.us/reader031/viewer/2022030214/5899c4731a28ab45548b5739/html5/thumbnails/27.jpg)
27
Multi-speaker
![Page 28: Parametric Speech Synthesis (D3L5 Deep Learning for Speech and Language UPC 2017)](https://reader031.vdocuments.us/reader031/viewer/2022030214/5899c4731a28ab45548b5739/html5/thumbnails/28.jpg)
28
Adaptation to new speaker
![Page 29: Parametric Speech Synthesis (D3L5 Deep Learning for Speech and Language UPC 2017)](https://reader031.vdocuments.us/reader031/viewer/2022030214/5899c4731a28ab45548b5739/html5/thumbnails/29.jpg)
29
Speaker interpolation
Interpolation: 0 1
Interpolation: 0.25 0.75
Interpolation: 0.5 0.5
Interpolation: 0.75 0.25
Interpolation 1 0
![Page 30: Parametric Speech Synthesis (D3L5 Deep Learning for Speech and Language UPC 2017)](https://reader031.vdocuments.us/reader031/viewer/2022030214/5899c4731a28ab45548b5739/html5/thumbnails/30.jpg)
Conclusions
30
● Quality DL-based on SPSS is much better than conventional SPSS
● Concatenative Hybrid systems also benefit from Deep Learning (eg. Apple SIRI)
● Some research on add better linguistic features to add expressivity (eg. sentiment analysis features)
● A vocoder is still used on most systems which degrades quality … (but … more on wavenet tomorrow)
![Page 31: Parametric Speech Synthesis (D3L5 Deep Learning for Speech and Language UPC 2017)](https://reader031.vdocuments.us/reader031/viewer/2022030214/5899c4731a28ab45548b5739/html5/thumbnails/31.jpg)
References
31
Statistical parametric speech synthesis: from HMM to LSTM-RNN.Heiga Zen, Googlertthss2015.talp.cat/ (slides, video, .. and references)
Deep learning applied to Speech Synthesis, 2016Santiago Pascual, MSc Thesis, UPCveu.talp.cat/doc/MSC_Santiago_Pascual.pdf[eusipco-2016] [ssw9]