synthesis (tts) text-to-speech part v · sent : “… teletubbies …“ ... – morpho-syntactic...

Copyright (c)2002 Faculté Polytechnique de Mons - T. Dutoit

PART VText-to-SpeechSynthesis (TTS)

TTS: What for ?

• Telephone-based applications– Telecommunications ($)

• Who’s calling• Integrated messaging (fax, email, answering

machine)• Automatic reverse directory• Personal telephone attendant

– Voice access to databases (70% of calls requirevery little interactivity)

• Price lists• Cultural events• Weather report

• Multimedia– CDRoms– Talking books– Interactive games

• Man-machinecommunication

TTS: What for ?


TTS: What for ?

• Help to thedisabled– Speech impairment

• Artificial voice

– Sight impairment• Automatic reading of

electronic documents• Automatic reading of

paper documents(with OCR)

TTS: What for ?

• Fundamentalresearch

TEXT

SPEECH

DIGITAL SIGNALPROCESSING

NATURAL LANGUAGEPROCESSING

PhonesInt/Dur

TEXT-TO-SPEECH SYNTHESIZER

NarrowPhonetic

Transcription

TTS = NLP + DSP

PhonetizationSpeech

Synthesis

Intonation/Duration

Generation

(a) (b) (c)

To be or not tobe, that is the

question.

_ 210t 40U 55 0 173 75 173b 80 10 160i: 198 5 173 75 235…

Automatic phonetization• Dictionnary look-up ?

To be or not tobe, that is the

question.

_ t U b i: Q r n Q tt U b i: _ D { t s D@ k w e s tS @ n _

Be b i:Not n Q t

Or O rQuestion k w e s tS @ nThat D { tThe D @

To t U‘s s


Automatic phonetization• More complex than that!

Problem Example Level InformationAssimilation nasality or sonority

assimimation, vocalicharmonization

word/sentence reading style,pronunciationof neighbors

Heterophonic part-of-speech,

homographs meaning(rare)

syntacticarticulation,

pronunciationof neighbors,

speaking style

syntacticarticulation,

New words proopiomelancortin word spellinganalogy

Proper names your name here ... word morphology,analogy

the, record, contrast, read, est, couvent,

portions, etc.

table rouge, je ne tele redirai pas

très utile, deux àdeux, plat exquis

word

Schwadeletion

sentence

Phoneticliaisons

sentence

100

200

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2time (s)

freq. (Hz)

l e t E k n i g d ´ t { E tm a) n y m e { i gd ´ la p a { ç l

Intonation

• Why ups and downs?– Stress (word level) � Accent (phrase level)

(Phonetization↑)

• Modify slightly � unnatural !

IntonationI saw him yesterday.

I saw him yesterday.







a. b. c.

The term 'prosody' refers to certain properties of the speech signal.

d.

(a,b) Focus (c) Finality/continuity (d) Grouping, using phrase-level accent

Phoneme Duration

• Not constant• Not fixed for a given phoneme• Linked to intonation

(longer on accented syllables)

Q l I s I z ´ d v e n tS ´ z


Intonation/Duration

'Twas brillig, and the slithy toves Did gyre and gimble in the wabeAll mimst were the borogroves, And the mome raths outgrabe.

Lewis Carroll, Jabberwocky

Coarticulation !!!

• Synthesis : be able to mimic coarticulation!• (Recognition : be able to overcome it!)

0

600

800

0 1000 2000 3000

F (Hz)1

F (Hz)2

200

400

u y i

Oøç

o

a

e

E

Challenges : a summary

• Accurate automatic phonetization (≠dictionnary look-up)• Prosody generation(i.e., intonation and phoneme durations)

must be “coherent”; easy to produce unnatural prosody• Synthesis of phoneme sequences with corresponding prosody

– Coarticulation!– Segmental quality should be maintained after pitch and

duration modification• Engineering

– Low design and maintenance cost– Low computational and

memory cost– Easy adaptation to

other languages

Intelligible – Natural – Cost effective

Contents

• Introduction

• Acoustic speech synthesis (DSP)– Model-based (rule-based) approach– Instance-based (concatenative) approach

• From text to phonemes and prosody (NLP)– Preprocessing– Morpho-syntactic analysis– Phonetization– Prosody Generation


Von Kempelen’s talkingmachine (1791)

Mouth

Nostrils

Main bellows

Small bellows

'S' pipe

'Sh' pipe

'Sh' lever'S' lever

(J.S. Liénard, LIMSI)

Omer Dudley’s Voder(Bell Labs, 1936)

NoiseNoiseNoiseNoiseSourceSourceSourceSource

OscillatorOscillatorOscillatorOscillator

Resonnance ControlResonnance ControlResonnance ControlResonnance Control AmplifierAmplifierAmplifierAmplifier

106 7 8 9

"Quiet"

t-dp-bk-g

Energy switchwrist bar

VoderConsoleKeyboard

12 3 4

5

Pitch-controlpedal

UV

V

1. John Holmes’ formantsynthesizer (1964)

Rule-based Synthesis

Haskins Labs (1968) DecTalk (1983)InfoVox (1983-95)

1. John Holmes’ formantsynthesizer (1964)

Glottalpulses resonators B-Q

Noise resonators B-Q

A V

A NV

F 1 F 2 F 3

F 1 F 2

++ +F Z

F Z,n F P ,n

F 0

x(n)+


Contents

• Introduction• Acoustic speech synthesis (DSP)

– Model-based (rule-based) approach

– Instance-based (concatenative)approach


2. Diphone concatenation(1977)

2. Diphone concatenation(1977)

DiphoneDiphoneDiphoneDiphone

DatabaseDatabaseDatabaseDatabase

ProsodyProsodyProsodyProsody

ModificationModificationModificationModification

_ d o g _

50ms 80ms 160ms 70ms 50ms

F0

_d do og g _

Smooth joints

0 1000 2000 3000 4000 5000 6000 7000 8000-1

-0.50

0.51 x 104

Joe Olive’s LPC synthesizer(1977)

Olive(1980)FPMs (1989)

V/UVV/UVV/UVV/UVPPPP coefficientscoefficientscoefficientscoefficients

1111pppp

A (z)A (z)A (z)A (z)σ

σ

P

UVUVUVUV

VVVV


Christian Hamon’s PSOLA(1988)

L

Cnet (1990) Limsi (Paris , 1992)

T. Dutoit’s MBROLA (1993)

• Based on the same Poisson’s sum formula asPSOLA, but using edited diphones

• Similar overal quality as PSOLA• Same computatinoal load• Completely automatic!

⇒ can be used to create lots of compatiblesynthesizers

Ma voix...

J’ai été conçu...

The MBROLA project

3. Automatic unit selection(1997)

DiphoneDiphoneDiphoneDiphone

DatabaseDatabaseDatabaseDatabase



_ d o g _


F0

_d do og g _

Smooth joints

0 1000 2000 3000 4000 5000 6000 7000 8000-1

-0.50

0.51 x 104

DiphoneDiphoneDiphoneDiphone----basedbasedbasedbasedsynthesissynthesissynthesissynthesis



VERYVERYVERYVERYLARGELARGELARGELARGE

CORPUSCORPUSCORPUSCORPUS



_ d o g _


F0

_d do og g _

Smooth joints

0 1000 2000 3000 4000 5000 6000 7000 8000-1

-0.50

0.51 x 104

Unit selectionUnit selectionUnit selectionUnit selection----basedbasedbasedbasedsynthesissynthesissynthesissynthesis

(Univ. Edinburgh, 1997)

(AT&T, 1998)(L&H, 1999)

(ATR, 1996)

(Loquendo, 2001)(Babel Technologies, 2002)


How to get the best sequence of units for a givenutterance? Viterbi search

• Target cost ?How to predict which units will sound as they wouldnaturally connected? (should be perceptual)

• Concatenation cost ?How to predict which sequences of units will soundnaturally connected? (should be perceptual)

Target j

Unit i-1 Unit i+1

Concatenation costcc(i-1,i)

Concatenation costcc(i,i+1)

target costtc(j,i)

Unit i

3. Automatic unit selection

sent : “To be …”phonet : _ t U b i: …stress : ^ …tone : l H …dur : 210 40 55 80 198 …f0 : …

Target j

Target costtc(j,i)

Unit iVeryLargecorpus

sent : “… to bear.“phonet : t U b E@ …stress : ^ …tone : l L …dur : 150 50 85 90 150 …f0 :

Formants:

TARGET


VeryLargecorpus


Formants:

Unit i

Concatenationcost cc(i,i+1)

Unit I+1

sent : “… teletubbies …“phonet : t @ b i: s …stress : ^ …tone : L l …dur : 80 95 90 130 …f0 :

Formants:




Formants:

Unit i-1

Concatenationcost cc(i-1,i)

Unit i


Formants: …

=0 in case of successive units

VeryLargecorpus

Contents

• Introduction

• Acoustic speech synthesis (DSP)– Model-based (rule-based) approach– Instance-based (concatenative) approach

• From text to phonemes and prosody(NLP)– Preprocessing– Morpho-syntactic analysis– Phonetization– Prosody Generation

Text

Text AnalyzerText AnalyzerText AnalyzerText Analyzer

MorphologicalAnalyzer

ContextualAnalyzer

Letter-To-

module

Prosodygenerator

to the DSP block

Sound

The NLP moduleThe NLP moduleThe NLP moduleThe NLP module

Pre-Processor

or

M

DS

FSs

L

Syntactic-

ParserProsodic

From text to phonetics

1

2

3

4

5

6

'tgl.' = 'täglich', 'tägliche', 'täglichem', 'täglichen', 'täglicher', 'tägliches''Dr. Jones lives at the corner of Jones Dr. and St. James St.

Recognizing acronyms'L.T.S.', 'UFO', FPMs, ...

Processing numbers'3.14', '2.16 pm', '13:26', '08.11.94', 'the 16th'

1. Pre-processing

Simple Regular grammars do most of the job(Lex – FSTs)


St{ ↔ st{

2. Morphological analysis

Increasingly : use of graphotactic trained systems(ex : TNT http://www.coli.uni-sb.de/~thorsten/tnt/)

Or even : brute force : inflected dictionnary

3. Contextual analysis

N_sg

Verb

Prep

Conj

Adj

N_pl

V_3sg Prep

N_sg

Verb

Infinitivemarker

Dogs like to bark

3. Contextual analysis :n-grams

Sentence W=(w1, w2, ..., wN)

All possible sequences T=(t1, t2, ..., tN)

� Best sequence of tags T : arg max )T = T WT

P( |

( | ) ( )ˆ arg max ( ) arg max ( | ) ( )( )

P P Bayes P PP

=

T T

W T TT = W T TW

1 2 1 2

,1 1 2 1 1 2 3 1 2 1 2 3

N

i=1

)1 2 1 2 1 3 2 1

1 2

( | ) = ( , ,..., | , , ..., )= ( | ) ( | , ) ( | , , , , )...

= ( | )

( )= ( , , ..., ) = ( ) ( | ) ( | , ...

= ( | , ,

N N

i i

N

i i i

P P w w w t t tP w t P w w t t P w w w t t t

P w t

P P t t t P t P t t P t t t

P t t t− −

∏ (Strong hypothesis 1)

W T

TN

i=1

..., )i nt−∏ (Strong hypothesis 2)

^̂̂̂

Example : « Dogs like to bark », using bi-grams (n=2)

… (all possible paths) …

P(N_pl,verb,Inf_marker,verb|dogs,like,to,bark)

= P(dogs|N_pl)

P(like|verb)

P(to|Inf_marker)

P(bark|verb)

P(N_pl|_)

P(verb|N_pl)

P(Inf_marker|verb)

P(verb|Inf_marker)

N

i=1

( | )i iP w t∏

N

1 2i=1

( | , ,..., )i i i i nP t t t t− − −∏

3. Contextual analysis :n-grams


4. Syntactic-Prosodic Phrasing

• Chinks’n chunks

a prosodic phrase =a sequence of chinks (≈function words)followed by a sequence of chunks (≈content words)

• Example :I asked themif they were going hometo Idahoand they said yesand anticipated one more stopbefore getting home

chink chunk

chink chunk

4. Syntactic-Prosodic Phrasing

• CART treeNo

YesNo

No

No Yes

Yes

No

No

No

No

No

No

Yes

No

No

No

Yes

No

No

Yes

Yes

Yes

Yes

No

No

Yes

Yes

sentence

st2974/3379

297/298

1108/1118

1126/1225

9/9 327/361

151/198

24/35

15/15

82/120

1108/1118

11/14 16/24

21/29

9/11

7/7

8/11

150/1885/6

tr249/423

j21617/1838

j31866/2261

fl491/613

tr164/252

fr156/223

j3156/216

j3154/205

et127/225

tr127/216

type112/201

st101/166

tr27/46

fl19/38

NoYes

Yes

No

No

No

final

Noj4

151/194

st = time to end of sentence

j3= tag of wor on the right

j2 = tag of word on the left

tr= utterance rate (inwords/second)

…

Classification and Regressiontrees (CARTs)

Predict Color(n) �� SHapes, Sizes (n,n-1,n+1,n-2,n+2,…) ?

SOL : if SHape(n-1)=SHape(n) �� Color(n)=White

else Color(n)=Black

This can be seen as a classification problem :

…TBCBCSW

……………………

…CSCSCBW

…CBSSCSB

…CS--SSB

…SH(n+1)S(n+1)SH(n-1)S(n-1)SH(n)S(n)C(n)

1 2 3 4 5 6 7 8 9 10 11 12 13 14


« impure »set

QYes No

Question whichsplits into best

« purified » sets

« pure »sets



« Purity » of a set?

= « Entropy » (bits)

= -P(Black) log2[P(Black)] -P(White) log2[P(White)]

Entropy = -(1/2 x -1) -(1/2 x -1)=1 bit

Total entropyafter split =

0+0=0

Entropy =-(1 x 0) -(0 x …)

=0 bit

Entropy =-(1 x 0) -(0 x …)

=0 bit

5. Automatic Phonetization• Rule-based, usually formalized as in

generative phonology :a [b] / l _ r

• Example : s -> [s] or [z]s [z] / [éanti|hanti] _ [<V>]s [s]/[anti|contre|impr‚|prime|tourne|ultra|psycho|télé]

_ []s [s] / [vrai]_[em]s [s] / [_a|para|sinu]_[e|o|y]s [z] / [tran] _ [a|h|i]s [z] /[<V>] _ [<V>]

5. Automatic Phonetization• Decision trees for predicting phonemes+stress

a b c d e

u v w

h i

e i o

_ e s

[eI]

focus character

first character on the right

first character on the left

second character on the right

second character on the left

(+ part-of-speech!)

= TRIE

6. Prosody generation : patterns- Si ces oeufs

étaient frais

j'en prendrais

Qui les vend ?

C'est bien toi,

ma jolie ?

- Evidemment,

Monsieur.

- Allons donc !

Prouve-le-moi.

continuationcontinuationcontinuationcontinuationminorminorminorminor

wh-questionwh-questionwh-questionwh-question

questionquestionquestionquestion

continuationcontinuationcontinuationcontinuationmajormajormajormajor

finalityfinalityfinalityfinality

echoechoechoecho

implicationimplicationimplicationimplication

parenthesisparenthesisparenthesisparenthesis

exclamationexclamationexclamationexclamation

orderorderorderorder


6.Prosody generation : tones

L L H L L H L L H L L L -

NB : more theories than researchers …

6.Prosody generation : tones• From text to tones

ce personnage grossier, te dérange-t-ilWS . . . o . o . . o .SG (. . . -)( . -) ( . . . - )IG 1 (. . . /LL)( . HH) ( . . . H/H)IG 2 (. . . - . HH) ( . . . H/H)

WS = word stress = lexical stress �� PhonetizationSG = stress groupIG = intonation group �� Synt.-Pros. Phrasing

(only one stressed syllable)

6.Prosody generation : tones• From tones to F0 (Hz) by rule :

c e il in g

lo w

ra n g e

f lo o rs lo p e

H +

H

L

L -

/L L H H H L - l -ssss ´́́́ pppp EEEE {{{{ ssss oooo nnnn aaaa ZZZZ gggg {{{{ oooo ssss jjjj eeee nnnn uuuu dddd eeee {{{{ AAAA )) )) ZZZZ ffff {{{{ AAAA )) )) SSSS mmmm AAAA )) ))

H +

H

L

L -

6.Prosody generation : tones• From tones to F0 (Hz), corpus-based

• Large speech dba, with known intonationgroups, F0 and tones

• For each target intonation group :– find a list of similar intonation groups (in terms

of tones, number of syllables, position insentence, etc.) in the dba

• Select the sequence of intonation groups in thedba which :

– best represents the target groups

– AND minimizes intonative discontinuities


6.Prosody generation : tones

If H,L, H+, L- = stressed syllablesh,l = unstressed syllables)

(l l l H) (l l H) (l l L-)

VITERBI

F0 (Hz)

VeryLargecorpus

TARGET

BadBetter

Conclusion

• Introduction• Acoustic speech synthesis (DSP)

– Model-based (rule-based) approach– Instance-based (concatenative) approach


• Towards...

Towards corpus-basedtechniques

– For automatic phonetization (L&H,ENST, Univ. Edinburgh, FPMs)

– For automatic generation ofintonation and phoneme duration(AT&T, FPMs, Univ. Aix, Univ.Edinburgh)

– For automatic selection of unitsfor concatenative synthesis (ATR,Univ. Edinburgh, AT&T, FPMs?)

1995-?: The database years

synthesis (tts) text-to-speech part v · sent : “… teletubbies …“ ... – morpho-syntactic...

Documents