synthesis (tts) text-to-speech part v · sent : “… teletubbies …“ ... – morpho-syntactic...
TRANSCRIPT
Copyright (c)2002 Faculté Polytechnique de Mons - T. Dutoit
PART VText-to-SpeechSynthesis (TTS)
TTS: What for ?
• Telephone-based applications– Telecommunications ($)
• Who’s calling• Integrated messaging (fax, email, answering
machine)• Automatic reverse directory• Personal telephone attendant
– Voice access to databases (70% of calls requirevery little interactivity)
• Price lists• Cultural events• Weather report
• Multimedia– CDRoms– Talking books– Interactive games
• Man-machinecommunication
TTS: What for ?
Copyright (c)2002 Faculté Polytechnique de Mons - T. Dutoit
TTS: What for ?
• Help to thedisabled– Speech impairment
• Artificial voice
– Sight impairment• Automatic reading of
electronic documents• Automatic reading of
paper documents(with OCR)
TTS: What for ?
• Fundamentalresearch
TEXT
SPEECH
DIGITAL SIGNALPROCESSING
NATURAL LANGUAGEPROCESSING
PhonesInt/Dur
TEXT-TO-SPEECH SYNTHESIZER
NarrowPhonetic
Transcription
TTS = NLP + DSP
PhonetizationSpeech
Synthesis
Intonation/Duration
Generation
(a) (b) (c)
To be or not tobe, that is the
question.
_ 210t 40U 55 0 173 75 173b 80 10 160i: 198 5 173 75 235…
Automatic phonetization• Dictionnary look-up ?
To be or not tobe, that is the
question.
_ t U b i: Q r n Q tt U b i: _ D { t s D@ k w e s tS @ n _
Be b i:Not n Q t
Or O rQuestion k w e s tS @ nThat D { tThe D @
To t U‘s s
Copyright (c)2002 Faculté Polytechnique de Mons - T. Dutoit
Automatic phonetization• More complex than that!
Problem Example Level InformationAssimilation nasality or sonority
assimimation, vocalicharmonization
word/sentence reading style,pronunciationof neighbors
Heterophonic part-of-speech,
homographs meaning(rare)
syntacticarticulation,
pronunciationof neighbors,
speaking style
syntacticarticulation,
New words proopiomelancortin word spellinganalogy
Proper names your name here ... word morphology,analogy
the, record, contrast, read, est, couvent,
portions, etc.
table rouge, je ne tele redirai pas
très utile, deux àdeux, plat exquis
word
Schwadeletion
sentence
Phoneticliaisons
sentence
100
200
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2time (s)
freq. (Hz)
l e t E k n i g d ´ t { E tm a) n y m e { i gd ´ la p a { ç l
Intonation
• Why ups and downs?– Stress (word level) � Accent (phrase level)
(Phonetization↑)
• Modify slightly � unnatural !
IntonationI saw him yesterday.
I saw him yesterday.
I saw him yesterday.
I saw him yesterday.
I saw him yesterday.
I saw him yesterday.
I saw him yesterday.
I saw him yesterday.
a. b. c.
The term 'prosody' refers to certain properties of the speech signal.
d.
(a,b) Focus (c) Finality/continuity (d) Grouping, using phrase-level accent
Phoneme Duration
• Not constant• Not fixed for a given phoneme• Linked to intonation
(longer on accented syllables)
Q l I s I z ´ d v e n tS ´ z
Copyright (c)2002 Faculté Polytechnique de Mons - T. Dutoit
Intonation/Duration
'Twas brillig, and the slithy toves Did gyre and gimble in the wabeAll mimst were the borogroves, And the mome raths outgrabe.
Lewis Carroll, Jabberwocky
Coarticulation !!!
• Synthesis : be able to mimic coarticulation!• (Recognition : be able to overcome it!)
0
600
800
0 1000 2000 3000
F (Hz)1
F (Hz)2
200
400
u y i
Oøç
o
a
e
E
Challenges : a summary
• Accurate automatic phonetization (≠dictionnary look-up)• Prosody generation(i.e., intonation and phoneme durations)
must be “coherent”; easy to produce unnatural prosody• Synthesis of phoneme sequences with corresponding prosody
– Coarticulation!– Segmental quality should be maintained after pitch and
duration modification• Engineering
– Low design and maintenance cost– Low computational and
memory cost– Easy adaptation to
other languages
Intelligible – Natural – Cost effective
Contents
• Introduction
• Acoustic speech synthesis (DSP)– Model-based (rule-based) approach– Instance-based (concatenative) approach
• From text to phonemes and prosody (NLP)– Preprocessing– Morpho-syntactic analysis– Phonetization– Prosody Generation
Copyright (c)2002 Faculté Polytechnique de Mons - T. Dutoit
Von Kempelen’s talkingmachine (1791)
Mouth
Nostrils
Main bellows
Small bellows
'S' pipe
'Sh' pipe
'Sh' lever'S' lever
(J.S. Liénard, LIMSI)
Omer Dudley’s Voder(Bell Labs, 1936)
NoiseNoiseNoiseNoiseSourceSourceSourceSource
OscillatorOscillatorOscillatorOscillator
Resonnance ControlResonnance ControlResonnance ControlResonnance Control AmplifierAmplifierAmplifierAmplifier
106 7 8 9
"Quiet"
t-dp-bk-g
Energy switchwrist bar
VoderConsoleKeyboard
12 3 4
5
Pitch-controlpedal
UV
V
1. John Holmes’ formantsynthesizer (1964)
Rule-based Synthesis
Haskins Labs (1968) DecTalk (1983)InfoVox (1983-95)
1. John Holmes’ formantsynthesizer (1964)
Glottalpulses resonators B-Q
Noise resonators B-Q
A V
A NV
F 1 F 2 F 3
F 1 F 2
++ +F Z
F Z,n F P ,n
F 0
x(n)+
Copyright (c)2002 Faculté Polytechnique de Mons - T. Dutoit
Contents
• Introduction• Acoustic speech synthesis (DSP)
– Model-based (rule-based) approach
– Instance-based (concatenative)approach
• From text to phonemes and prosody (NLP)– Preprocessing– Morpho-syntactic analysis– Phonetization– Prosody Generation
2. Diphone concatenation(1977)
2. Diphone concatenation(1977)
DiphoneDiphoneDiphoneDiphone
DatabaseDatabaseDatabaseDatabase
ProsodyProsodyProsodyProsody
ModificationModificationModificationModification
_ d o g _
50ms 80ms 160ms 70ms 50ms
F0
_d do og g _
Smooth joints
0 1000 2000 3000 4000 5000 6000 7000 8000-1
-0.50
0.51 x 104
Joe Olive’s LPC synthesizer(1977)
Olive(1980)FPMs (1989)
V/UVV/UVV/UVV/UVPPPP coefficientscoefficientscoefficientscoefficients
1111pppp
A (z)A (z)A (z)A (z)σ
σ
P
UVUVUVUV
VVVV
Copyright (c)2002 Faculté Polytechnique de Mons - T. Dutoit
Christian Hamon’s PSOLA(1988)
L
Cnet (1990) Limsi (Paris , 1992)
T. Dutoit’s MBROLA (1993)
• Based on the same Poisson’s sum formula asPSOLA, but using edited diphones
• Similar overal quality as PSOLA• Same computatinoal load• Completely automatic!
⇒ can be used to create lots of compatiblesynthesizers
Ma voix...
J’ai été conçu...
The MBROLA project
3. Automatic unit selection(1997)
DiphoneDiphoneDiphoneDiphone
DatabaseDatabaseDatabaseDatabase
ProsodyProsodyProsodyProsody
ModificationModificationModificationModification
_ d o g _
50ms 80ms 160ms 70ms 50ms
F0
_d do og g _
Smooth joints
0 1000 2000 3000 4000 5000 6000 7000 8000-1
-0.50
0.51 x 104
DiphoneDiphoneDiphoneDiphone----basedbasedbasedbasedsynthesissynthesissynthesissynthesis
Copyright (c)2002 Faculté Polytechnique de Mons - T. Dutoit
3. Automatic unit selection(1997)
VERYVERYVERYVERYLARGELARGELARGELARGE
CORPUSCORPUSCORPUSCORPUS
ProsodyProsodyProsodyProsody
ModificationModificationModificationModification
_ d o g _
50ms 80ms 160ms 70ms 50ms
F0
_d do og g _
Smooth joints
0 1000 2000 3000 4000 5000 6000 7000 8000-1
-0.50
0.51 x 104
Unit selectionUnit selectionUnit selectionUnit selection----basedbasedbasedbasedsynthesissynthesissynthesissynthesis
(Univ. Edinburgh, 1997)
(AT&T, 1998)(L&H, 1999)
(ATR, 1996)
(Loquendo, 2001)(Babel Technologies, 2002)
3. Automatic unit selection(1997)
How to get the best sequence of units for a givenutterance? Viterbi search
• Target cost ?How to predict which units will sound as they wouldnaturally connected? (should be perceptual)
• Concatenation cost ?How to predict which sequences of units will soundnaturally connected? (should be perceptual)
Target j
Unit i-1 Unit i+1
Concatenation costcc(i-1,i)
Concatenation costcc(i,i+1)
target costtc(j,i)
Unit i
3. Automatic unit selection
sent : “To be …”phonet : _ t U b i: …stress : ^ …tone : l H …dur : 210 40 55 80 198 …f0 : …
Target j
Target costtc(j,i)
Unit iVeryLargecorpus
sent : “… to bear.“phonet : t U b E@ …stress : ^ …tone : l L …dur : 150 50 85 90 150 …f0 :
Formants:
TARGET
3. Automatic unit selection
VeryLargecorpus
sent : “… to bear.“phonet : t U b E@ …stress : ^ …tone : l L …dur : 150 50 85 90 150 …f0 :
Formants:
Unit i
Concatenationcost cc(i,i+1)
Unit I+1
sent : “… teletubbies …“phonet : t @ b i: s …stress : ^ …tone : L l …dur : 80 95 90 130 …f0 :
Formants:
Copyright (c)2002 Faculté Polytechnique de Mons - T. Dutoit
3. Automatic unit selection
sent : “… to bear.“phonet : t U b E@ …stress : ^ …tone : l L …dur : 150 50 85 90 150 …f0 :
Formants:
Unit i-1
Concatenationcost cc(i-1,i)
Unit i
sent : “… to bear.“phonet : t U b E@ …stress : ^ …tone : l L …dur : 150 50 85 90 150 …f0 :
Formants: …
=0 in case of successive units
VeryLargecorpus
Contents
• Introduction
• Acoustic speech synthesis (DSP)– Model-based (rule-based) approach– Instance-based (concatenative) approach
• From text to phonemes and prosody(NLP)– Preprocessing– Morpho-syntactic analysis– Phonetization– Prosody Generation
Text
Text AnalyzerText AnalyzerText AnalyzerText Analyzer
MorphologicalAnalyzer
ContextualAnalyzer
Letter-To-
module
Prosodygenerator
to the DSP block
Sound
The NLP moduleThe NLP moduleThe NLP moduleThe NLP module
Pre-Processor
or
M
DS
FSs
L
Syntactic-
ParserProsodic
From text to phonetics
1
2
3
4
5
6
'tgl.' = 'täglich', 'tägliche', 'täglichem', 'täglichen', 'täglicher', 'tägliches''Dr. Jones lives at the corner of Jones Dr. and St. James St.
Recognizing acronyms'L.T.S.', 'UFO', FPMs, ...
Processing numbers'3.14', '2.16 pm', '13:26', '08.11.94', 'the 16th'
1. Pre-processing
Simple Regular grammars do most of the job(Lex – FSTs)
Copyright (c)2002 Faculté Polytechnique de Mons - T. Dutoit
St{ ↔ st{
2. Morphological analysis
Increasingly : use of graphotactic trained systems(ex : TNT http://www.coli.uni-sb.de/~thorsten/tnt/)
Or even : brute force : inflected dictionnary
3. Contextual analysis
N_sg
Verb
Prep
Conj
Adj
N_pl
V_3sg Prep
N_sg
Verb
Infinitivemarker
Dogs like to bark
3. Contextual analysis :n-grams
Sentence W=(w1, w2, ..., wN)
All possible sequences T=(t1, t2, ..., tN)
� Best sequence of tags T : arg max )T = T WT
P( |
( | ) ( )ˆ arg max ( ) arg max ( | ) ( )( )
P P Bayes P PP
=
T T
W T TT = W T TW
1 2 1 2
,1 1 2 1 1 2 3 1 2 1 2 3
N
i=1
)1 2 1 2 1 3 2 1
1 2
( | ) = ( , ,..., | , , ..., )= ( | ) ( | , ) ( | , , , , )...
= ( | )
( )= ( , , ..., ) = ( ) ( | ) ( | , ...
= ( | , ,
N N
i i
N
i i i
P P w w w t t tP w t P w w t t P w w w t t t
P w t
P P t t t P t P t t P t t t
P t t t− −
∏ (Strong hypothesis 1)
W T
TN
i=1
..., )i nt−∏ (Strong hypothesis 2)
^̂̂̂
Example : « Dogs like to bark », using bi-grams (n=2)
… (all possible paths) …
P(N_pl,verb,Inf_marker,verb|dogs,like,to,bark)
= P(dogs|N_pl)
P(like|verb)
P(to|Inf_marker)
P(bark|verb)
P(N_pl|_)
P(verb|N_pl)
P(Inf_marker|verb)
P(verb|Inf_marker)
N
i=1
( | )i iP w t∏
N
1 2i=1
( | , ,..., )i i i i nP t t t t− − −∏
3. Contextual analysis :n-grams
Copyright (c)2002 Faculté Polytechnique de Mons - T. Dutoit
4. Syntactic-Prosodic Phrasing
• Chinks’n chunks
a prosodic phrase =a sequence of chinks (≈function words)followed by a sequence of chunks (≈content words)
• Example :I asked themif they were going hometo Idahoand they said yesand anticipated one more stopbefore getting home
chink chunk
chink chunk
4. Syntactic-Prosodic Phrasing
• CART treeNo
YesNo
No
No Yes
Yes
No
No
No
No
No
No
Yes
No
No
No
Yes
No
No
Yes
Yes
Yes
Yes
No
No
Yes
Yes
sentence
st2974/3379
297/298
1108/1118
1126/1225
9/9 327/361
151/198
24/35
15/15
82/120
1108/1118
11/14 16/24
21/29
9/11
7/7
8/11
150/1885/6
tr249/423
j21617/1838
j31866/2261
fl491/613
tr164/252
fr156/223
j3156/216
j3154/205
et127/225
tr127/216
type112/201
st101/166
tr27/46
fl19/38
NoYes
Yes
No
No
No
final
Noj4
151/194
st = time to end of sentence
j3= tag of wor on the right
j2 = tag of word on the left
tr= utterance rate (inwords/second)
…
Classification and Regressiontrees (CARTs)
Predict Color(n) ���� SHapes, Sizes (n,n-1,n+1,n-2,n+2,…) ?
SOL : if SHape(n-1)=SHape(n) ���� Color(n)=White
else Color(n)=Black
This can be seen as a classification problem :
…TBCBCSW
……………………
…CSCSCBW
…CBSSCSB
…CS--SSB
…SH(n+1)S(n+1)SH(n-1)S(n-1)SH(n)S(n)C(n)
1 2 3 4 5 6 7 8 9 10 11 12 13 14
Classification and Regressiontrees (CARTs)
« impure »set
QYes No
Question whichsplits into best
« purified » sets
« pure »sets
Copyright (c)2002 Faculté Polytechnique de Mons - T. Dutoit
Classification and Regressiontrees (CARTs)
« Purity » of a set?
= « Entropy » (bits)
= -P(Black) log2[P(Black)] -P(White) log2[P(White)]
Entropy = -(1/2 x -1) -(1/2 x -1)=1 bit
Total entropyafter split =
0+0=0
Entropy =-(1 x 0) -(0 x …)
=0 bit
Entropy =-(1 x 0) -(0 x …)
=0 bit
5. Automatic Phonetization• Rule-based, usually formalized as in
generative phonology :a [b] / l _ r
• Example : s -> [s] or [z]s [z] / [éanti|hanti] _ [<V>]s [s]/[anti|contre|impr‚|prime|tourne|ultra|psycho|télé]
_ []s [s] / [vrai]_[em]s [s] / [_a|para|sinu]_[e|o|y]s [z] / [tran] _ [a|h|i]s [z] /[<V>] _ [<V>]
5. Automatic Phonetization• Decision trees for predicting phonemes+stress
a b c d e
u v w
h i
e i o
_ e s
[eI]
focus character
first character on the right
first character on the left
second character on the right
second character on the left
(+ part-of-speech!)
= TRIE
6. Prosody generation : patterns- Si ces oeufs
étaient frais
j'en prendrais
Qui les vend ?
C'est bien toi,
ma jolie ?
- Evidemment,
Monsieur.
- Allons donc !
Prouve-le-moi.
continuationcontinuationcontinuationcontinuationminorminorminorminor
wh-questionwh-questionwh-questionwh-question
questionquestionquestionquestion
continuationcontinuationcontinuationcontinuationmajormajormajormajor
finalityfinalityfinalityfinality
echoechoechoecho
implicationimplicationimplicationimplication
parenthesisparenthesisparenthesisparenthesis
exclamationexclamationexclamationexclamation
orderorderorderorder
Copyright (c)2002 Faculté Polytechnique de Mons - T. Dutoit
6.Prosody generation : tones
L L H L L H L L H L L L -
NB : more theories than researchers …
6.Prosody generation : tones• From text to tones
ce personnage grossier, te dérange-t-ilWS . . . o . o . . o .SG (. . . -)( . -) ( . . . - )IG 1 (. . . /LL)( . HH) ( . . . H/H)IG 2 (. . . - . HH) ( . . . H/H)
WS = word stress = lexical stress ���� PhonetizationSG = stress groupIG = intonation group ���� Synt.-Pros. Phrasing
(only one stressed syllable)
6.Prosody generation : tones• From tones to F0 (Hz) by rule :
c e il in g
lo w
ra n g e
f lo o rs lo p e
H +
H
L
L -
/L L H H H L - l -ssss ´́́́ pppp EEEE {{{{ ssss oooo nnnn aaaa ZZZZ gggg {{{{ oooo ssss jjjj eeee nnnn uuuu dddd eeee {{{{ AAAA )) )) ZZZZ ffff {{{{ AAAA )) )) SSSS mmmm AAAA )) ))
H +
H
L
L -
6.Prosody generation : tones• From tones to F0 (Hz), corpus-based
• Large speech dba, with known intonationgroups, F0 and tones
• For each target intonation group :– find a list of similar intonation groups (in terms
of tones, number of syllables, position insentence, etc.) in the dba
• Select the sequence of intonation groups in thedba which :
– best represents the target groups
– AND minimizes intonative discontinuities
Copyright (c)2002 Faculté Polytechnique de Mons - T. Dutoit
6.Prosody generation : tones
If H,L, H+, L- = stressed syllablesh,l = unstressed syllables)
(l l l H) (l l H) (l l L-)
VITERBI
F0 (Hz)
VeryLargecorpus
TARGET
BadBetter
Conclusion
• Introduction• Acoustic speech synthesis (DSP)
– Model-based (rule-based) approach– Instance-based (concatenative) approach
• From text to phonemes and prosody (NLP)– Preprocessing– Morpho-syntactic analysis– Phonetization– Prosody Generation
• Towards...
Towards corpus-basedtechniques
– For automatic phonetization (L&H,ENST, Univ. Edinburgh, FPMs)
– For automatic generation ofintonation and phoneme duration(AT&T, FPMs, Univ. Aix, Univ.Edinburgh)
– For automatic selection of unitsfor concatenative synthesis (ATR,Univ. Edinburgh, AT&T, FPMs?)
1995-?: The database years