discrete/continuous modelling of speaking style in...

4
Discrete/Continuous Modelling of Speaking Style in HMM-based Speech Synthesis: Design and Evaluation Nicolas Obin 1,2 , Pierre Lanchantin 1 Anne Lacheret 2 , Xavier Rodet 1 1 Analysis-Synthesis Team, IRCAM, Paris, France 2 Modyco Lab., University of Paris Ouest - La D´ efense, Nanterre, France [email protected], [email protected], [email protected], [email protected] Abstract This paper assesses the ability of a HMM-based speech synthe- sis systems to model the speech characteristics of various speak- ing styles 1 . A discrete/continuous HMM is presented to model the symbolic and acoustic speech characteristics of a speak- ing style. The proposed model is used to model the average characteristics of a speaking style that is shared among various speakers, depending on specific situations of speech communi- cation. The evaluation consists of an identification experiment of 4 speaking styles based on delexicalized speech, and com- pared to a similar experiment on natural speech. The compari- son is discussed and reveals that discrete/continuous HMM con- sistently models the speech characteristics of a speaking style. Index Terms: speaking style, speech synthesis, speech prosody, average modelling. 1. Introduction Each speaker has his own speaking style which constitutes his vocal signature, and a part of his identity. Nevertheless, a speaker continuously adapt his speaking style according to spe- cific communication situations, and to his emotional state. In particular, each situational context determines a specific mode of production associated with it - a genre - which is defined by a set of conventions of form and content that are shared among all of its productions [1]. In particular, a specific discourse genre (DG) relates to a specific speaking style. Consequently, a speaker adapts his speaking style to each specific situation de- pending on the formal conventions that are associated with the situation, his a-priori knowledge about these conventions, and his competence to adapt his speaking style. Thus, each com- munication act instantiates a style which is composed of a style that depends on the speaker identity, and a conventional speak- ing style that is conditioned by a specific situation. In speech synthesis, methods have been proposed to model and adapt the symbolic [2, 3] and acoustic speech characteristics of a speaking style, with application to emotional speech synthesis [4]. However, no study exists on the joint modelling of the sym- bolic and acoustic characteristics of speaking style, and speak- ing style acoustic modelling generally limits to the modelling of emotion, with rare extensions to other sources of speaking styles variations [5]. 1 This study was partially funded by “La Fondation Des Treilles”, and supported by ANR Rhapsodie 07 Corp-030-01; reference prosody corpus of spoken French; French National Agency of research; 2008- 2012. This paper presents an average discrete/continuous HMM which is applied to the speaking style modelling of various dis- course genres in speech synthesis, and assesses whether the model adequately captures the speech prosody characteristics of a speaking style. Incidentally, the robustness of the HMM- based speech synthesis is evaluated in the conditions of real- world applications. The paper is organized as follows: the speaking style corpus design is described in section 2; the aver- age discrete/continuous HMM model is presented in section 3; the evaluation is presented and discussed in sections 4 and 5. 2. Speech & Text Material 2.1. Corpus Design For the purpose of speaking style speech synthesis, a 4-hour multi-speakers speech database was designed. The speech database consists of four different DG’s: catholic mass cere- mony, political, journalistic, and sport commentary. In order to reduce the DG intra-variability, the different DGs were re- stricted to specific situational contexts (see list below) and to male speakers only. 2.2 2 1.8 1.6 1.4 1.2 4.5 4.6 4.7 4.8 4.9 5 5.1 5.2 5.3 5.4 5.5 M1 log syllable duration log f 0 M2 M3 M4 M5 M6 M7 P1 P2 P3 P4 P5 J1 J2 J3 J4 J5 S1 S2 S3 S4 S5 S6 log f 0 log(1/speech rate) he following is a des Figure 1: Prosodic description of the speaking styles depending on the speaker. Mean and vari- ance of f0 and speech rate (syllable per second). The following is a description of the four selected DG’s: mass: Christian church sermon (pilgrimage and Sunday high- mass sermons); single speaker monologue, no interaction. political: New Year’s French president speech; single speaker monologue; no interaction. journal: radio review (press review; political, economical,

Upload: others

Post on 23-Jul-2020

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Discrete/Continuous Modelling of Speaking Style in …articles.ircam.fr/textes/Obin11b/index.pdfDiscrete/Continuous Modelling of Speaking Style in HMM-based Speech Synthesis: Design

Discrete/Continuous Modelling of Speaking Stylein HMM-based Speech Synthesis:

Design and Evaluation

Nicolas Obin 1,2, Pierre Lanchantin 1

Anne Lacheret 2, Xavier Rodet 1

1 Analysis-Synthesis Team, IRCAM, Paris, France2 Modyco Lab., University of Paris Ouest - La Defense, Nanterre, France

[email protected], [email protected], [email protected], [email protected]

AbstractThis paper assesses the ability of a HMM-based speech synthe-sis systems to model the speech characteristics of various speak-ing styles1. A discrete/continuous HMM is presented to modelthe symbolic and acoustic speech characteristics of a speak-ing style. The proposed model is used to model the averagecharacteristics of a speaking style that is shared among variousspeakers, depending on specific situations of speech communi-cation. The evaluation consists of an identification experimentof 4 speaking styles based on delexicalized speech, and com-pared to a similar experiment on natural speech. The compari-son is discussed and reveals that discrete/continuous HMM con-sistently models the speech characteristics of a speaking style.Index Terms: speaking style, speech synthesis, speechprosody, average modelling.

1. IntroductionEach speaker has his own speaking style which constitutes hisvocal signature, and a part of his identity. Nevertheless, aspeaker continuously adapt his speaking style according to spe-cific communication situations, and to his emotional state. Inparticular, each situational context determines a specific modeof production associated with it - a genre - which is defined bya set of conventions of form and content that are shared amongall of its productions [1]. In particular, a specific discoursegenre (DG) relates to a specific speaking style. Consequently, aspeaker adapts his speaking style to each specific situation de-pending on the formal conventions that are associated with thesituation, his a-priori knowledge about these conventions, andhis competence to adapt his speaking style. Thus, each com-munication act instantiates a style which is composed of a stylethat depends on the speaker identity, and a conventional speak-ing style that is conditioned by a specific situation.In speech synthesis, methods have been proposed to model andadapt the symbolic [2, 3] and acoustic speech characteristics ofa speaking style, with application to emotional speech synthesis[4]. However, no study exists on the joint modelling of the sym-bolic and acoustic characteristics of speaking style, and speak-ing style acoustic modelling generally limits to the modellingof emotion, with rare extensions to other sources of speakingstyles variations [5].

1This study was partially funded by “La Fondation Des Treilles”,and supported by ANR Rhapsodie 07 Corp-030-01; reference prosodycorpus of spoken French; French National Agency of research; 2008-2012.

This paper presents an average discrete/continuous HMMwhich is applied to the speaking style modelling of various dis-course genres in speech synthesis, and assesses whether themodel adequately captures the speech prosody characteristicsof a speaking style. Incidentally, the robustness of the HMM-based speech synthesis is evaluated in the conditions of real-world applications. The paper is organized as follows: thespeaking style corpus design is described in section 2; the aver-age discrete/continuous HMM model is presented in section 3;the evaluation is presented and discussed in sections 4 and 5.

2. Speech & Text Material2.1. Corpus Design

For the purpose of speaking style speech synthesis, a 4-hourmulti-speakers speech database was designed. The speechdatabase consists of four different DG’s: catholic mass cere-mony, political, journalistic, and sport commentary. In orderto reduce the DG intra-variability, the different DGs were re-stricted to specific situational contexts (see list below) and tomale speakers only.

−2.2 −2 −1.8 −1.6 −1.4 −1.2

4.5

4.6

4.7

4.8

4.9

5

5.1

5.2

5.3

5.4

5.5

M1

log syllable duration

log

f 0

M2

M3

M4

M5M6M7

P1P2

P3

P4P5J1

J2

J3J4J5

S1

S2S3 S4

S5S6

DISC

RETE

/CO

NTIN

UOUS

MO

DELL

ING

OF

SPEA

KIN

GST

YLE

INH

MM

-BAS

EDSP

EECH

SYNT

HES

IS:

DESI

GN

AND

EVAL

UATI

ON

Nico

lasO

bin

1,2,P

ierr

eLan

chan

tin1

Anne

Lach

eret

2,X

avie

rRod

et1

1A

naly

sis-S

ynth

esis

Team

,IRC

AM

,Par

is,Fr

ance

2M

odyc

oLa

b.,U

nive

rsity

ofPa

risO

uest

-LaD

efen

se,N

ante

rre,F

ranc

[email protected],[email protected],

anne

.lac

heret@u-pa

ris1

0.fr,rode

t@ir

cam.fr

ABST

RACT

This

pape

ras

sess

esth

eab

ility

ofa

HM

M-b

ased

spee

chsy

nthe

-sis

syste

ms

tom

odel

the

spee

chch

arac

teris

tics

ofva

rious

spea

k-in

gsty

les1 .

Adi

scre

te/c

ontin

uous

HM

Mis

pres

ente

dto

mod

elth

esy

mbo

lican

dac

ousti

csp

eech

char

acte

ristic

sof

asp

eaki

ngsty

le.

The

prop

osed

mod

elis

used

tom

odel

the

aver

age

char

acte

ristic

sof

asp

eaki

ngsty

leth

atis

shar

edam

ong

vario

ussp

eake

rs,d

epen

d-in

gon

spec

ific

situa

tions

ofsp

eech

com

mun

icat

ion.

The

eval

uatio

nco

nsist

sof

anid

entifi

catio

nex

perim

ento

f4sp

eaki

ngsty

les

base

don

delex

ical

ized

spee

ch,a

ndco

mpa

red

toa

simila

rexp

erim

ento

nna

tura

lspe

ech.

The

com

paris

onis

disc

usse

dan

dre

veal

sth

atdi

s-cr

ete/

cont

inuo

usH

MM

cons

isten

tlym

odel

sthe

spee

chch

arac

teris

-tic

sofa

spea

king

style

.In

dexT

erm

s:sp

eaki

ngsty

le,s

peec

hsy

nthe

sis,s

peec

hpr

osod

y,av

-er

agem

odel

ling.

1.IN

TRO

DUCT

ION

Each

spea

kerh

ashi

sow

nsp

eaki

ngsty

lew

hich

cons

titut

eshi

svoc

alsig

natu

re,a

ndap

arto

fhis

iden

tity.

Nev

erth

eles

s,as

peak

erco

ntin

u-ou

slyad

apth

issp

eaki

ngsty

leac

cord

ing

tosp

ecifi

cco

mm

unic

atio

nsit

uatio

ns,a

ndto

hise

mot

iona

lsta

te.I

npa

rticu

lar,

each

situa

tiona

lco

ntex

tdet

erm

ines

asp

ecifi

cm

ode

ofpr

oduc

tion

asso

ciat

edw

ithit

-age

nre-

whi

chis

defin

edby

aset

ofco

nven

tions

offo

rman

dco

n-te

ntth

atar

esh

ared

amon

gal

lofi

tspr

oduc

tions

[1].

Inpa

rticu

lar,

asp

ecifi

cdi

scou

rse

genr

e(D

G)r

elat

esto

asp

ecifi

csp

eaki

ngsty

le.

Cons

eque

ntly

,asp

eake

rada

pts

his

spea

king

style

toea

chsp

ecifi

csit

uatio

nde

pend

ing

onth

efo

rmal

conv

entio

nsth

atar

eas

soci

ated

with

the

situa

tion,

his

a-pr

iori

know

ledg

eab

outt

hese

conv

entio

ns,

and

his

com

pete

nce

toad

apth

issp

eaki

ngsty

le.

Thus

,eac

hco

m-

mun

icat

ion

acti

nsta

ntia

tesa

style

whi

chis

com

pose

dof

asty

leth

atde

pend

son

the

spea

keri

dent

ity,a

nda

conv

entio

nals

peak

ing

style

that

isco

nditi

oned

byas

peci

ficsit

uatio

n.In

spee

chsy

nthe

sis,m

etho

dsha

vebe

enpr

opos

edto

mod

elan

dada

ptth

esym

bolic

[3,4

]and

acou

stics

peec

hch

arac

teris

ticso

fasp

eaki

ngsty

le,w

ithap

plic

atio

nto

emot

iona

lspe

ech

synt

hesis

[2].

How

ever

,no

study

exist

son

the

join

tmod

ellin

gof

the

sym

bolic

and

acou

stic

char

acte

ristic

sof

spea

king

style

,and

spea

king

style

acou

stic

mod

-el

ling

gene

rally

limits

toth

emod

ellin

gof

emot

ion,

with

rare

exte

n-sio

nsto

othe

rsou

rces

ofsp

eaki

ngsty

lesv

aria

tions

[5].

This

pape

rpre

sent

san

aver

age

disc

rete

/con

tinuo

usH

MM

whi

chis

appl

ied

toth

esp

eaki

ngsty

lem

odel

ling

ofva

rious

disc

ours

ege

nres

1 This

study

was

supp

orte

dby

AN

RRh

apso

die

07Co

rp-0

30-0

1;re

fer-

ence

pros

ody

corp

usof

spok

enFr

ench

;Fre

nch

Nat

iona

lAge

ncy

ofre

sear

ch;

2008

-201

2.

insp

eech

synt

hesis

,and

asse

sses

whe

ther

them

odel

adeq

uate

lyca

p-tu

rest

hesp

eech

pros

ody

char

acte

ristic

sofa

spea

king

style

.Inc

iden

-ta

lly,t

hero

bustn

esso

fthe

HM

M-b

ased

spee

chsy

nthe

sisis

eval

uate

din

the

cond

ition

sofr

eal-w

orld

appl

icat

ions

.The

pape

riso

rgan

ized

asfo

llow

s:th

esp

eaki

ngsty

leco

rpus

desig

nis

desc

ribed

inse

ctio

n2;

the

aver

age

disc

rete

/con

tinuo

usH

MM

mod

elis

pres

ente

din

sec-

tion

3;th

eev

alua

tion

ispr

esen

ted

and

disc

usse

din

sect

ions

4an

d5.

2.SP

EECH

&TE

XTM

ATER

IAL

2.1.

Corp

usDe

sign

Fort

hepu

rpos

eof

spea

king

style

spee

chsy

nthe

sis,a

4-ho

urm

ulti-

spea

kers

spee

chda

taba

sewa

sdes

igne

d.Th

esp

eech

data

base

con-

sists

offo

urdi

ffere

ntD

G’s:

cath

olic

mas

scer

emon

y,po

litic

al,j

our-

nalis

tic,a

ndsp

ortc

omm

enta

ry.

Inor

der

tore

duce

the

DG

intra

-va

riabi

lity,

the

diffe

rent

DG

sw

ere

restr

icte

dto

spec

ific

situa

tiona

lco

ntex

ts(s

eelis

tbel

ow)a

ndto

mal

espe

aker

sonl

y.

−2.2

−2−1

.8−1

.6−1

.4−1

.2

4.54.64.74.84.955.15.25.35.45.5

M1

log sy

llable

dura

tion

log f0

M2

M3

M4

M5M6

M7 P1P2P3

P4P5

J1

J2J3J4

J5S1

S2S3

S4S5

S6

Fig.

1.Pr

osod

icde

scrip

tion

ofth

esp

eaki

ngsty

lesd

e-pe

ndin

gon

the

spea

ker.

Mea

nan

dva

rianc

eof

f 0an

dsp

eech

rate

.

log

f 0

log(

1/sp

eech

rate

)Th

efo

llow

ing

isa

desc

riptio

nof

the

four

sele

cted

DG

’s:

DISCRETE/CONTINUOUS MODELLING OF SPEAKING STYLEIN HMM-BASED SPEECH SYNTHESIS:

DESIGN AND EVALUATION

Nicolas Obin 1,2, Pierre Lanchantin 1

Anne Lacheret 2, Xavier Rodet 1

1 Analysis-Synthesis Team, IRCAM, Paris, France2 Modyco Lab., University of Paris Ouest - La Defense, Nanterre, France

[email protected], [email protected], [email protected], [email protected]

ABSTRACT

This paper assesses the ability of a HMM-based speech synthe-sis systems to model the speech characteristics of various speak-ing styles1. A discrete/continuous HMM is presented to model thesymbolic and acoustic speech characteristics of a speaking style.The proposed model is used to model the average characteristicsof a speaking style that is shared among various speakers, depend-ing on specific situations of speech communication. The evaluationconsists of an identification experiment of 4 speaking styles basedon delexicalized speech, and compared to a similar experiment onnatural speech. The comparison is discussed and reveals that dis-crete/continuous HMM consistently models the speech characteris-tics of a speaking style.Index Terms: speaking style, speech synthesis, speech prosody, av-erage modelling.

1. INTRODUCTION

Each speaker has his own speaking style which constitutes his vocalsignature, and a part of his identity. Nevertheless, a speaker continu-ously adapt his speaking style according to specific communicationsituations, and to his emotional state. In particular, each situationalcontext determines a specific mode of production associated with it- a genre - which is defined by a set of conventions of form and con-tent that are shared among all of its productions [1]. In particular,a specific discourse genre (DG) relates to a specific speaking style.Consequently, a speaker adapts his speaking style to each specificsituation depending on the formal conventions that are associatedwith the situation, his a-priori knowledge about these conventions,and his competence to adapt his speaking style. Thus, each com-munication act instantiates a style which is composed of a style thatdepends on the speaker identity, and a conventional speaking stylethat is conditioned by a specific situation.In speech synthesis, methods have been proposed to model and adaptthe symbolic [3, 4] and acoustic speech characteristics of a speakingstyle, with application to emotional speech synthesis [2]. However,no study exists on the joint modelling of the symbolic and acousticcharacteristics of speaking style, and speaking style acoustic mod-elling generally limits to the modelling of emotion, with rare exten-sions to other sources of speaking styles variations [5].This paper presents an average discrete/continuous HMM which isapplied to the speaking style modelling of various discourse genres

1This study was supported by ANR Rhapsodie 07 Corp-030-01; refer-ence prosody corpus of spoken French; French National Agency of research;2008-2012.

in speech synthesis, and assesses whether the model adequately cap-tures the speech prosody characteristics of a speaking style. Inciden-tally, the robustness of the HMM-based speech synthesis is evaluatedin the conditions of real-world applications. The paper is organizedas follows: the speaking style corpus design is described in section2; the average discrete/continuous HMM model is presented in sec-tion 3; the evaluation is presented and discussed in sections 4 and5.

2. SPEECH & TEXT MATERIAL

2.1. Corpus Design

For the purpose of speaking style speech synthesis, a 4-hour multi-speakers speech database was designed. The speech database con-sists of four different DG’s: catholic mass ceremony, political, jour-nalistic, and sport commentary. In order to reduce the DG intra-variability, the different DGs were restricted to specific situationalcontexts (see list below) and to male speakers only.

−2.2 −2 −1.8 −1.6 −1.4 −1.2

4.5

4.6

4.7

4.8

4.9

5

5.1

5.2

5.3

5.4

5.5

M1

log syllable duration

log

f 0

M2

M3

M4

M5M6M7

P1P2

P3

P4P5J1

J2

J3J4J5

S1

S2S3 S4

S5S6

Fig. 1. Prosodic description of the speaking styles de-pending on the speaker. Mean and variance of f0 andspeech rate.

log f0

log(1/speech rate)The following is a description of the fourselected DG’s:Figure 1: Prosodic description of the speaking

styles depending on the speaker. Mean and vari-ance of f0 and speech rate (syllable per second).

The following is a description of the four selected DG’s:

mass: Christian church sermon (pilgrimage and Sunday high-mass sermons); single speaker monologue, no interaction.

political: New Year’s French president speech; single speakermonologue; no interaction.

journal: radio review (press review; political, economical,

Page 2: Discrete/Continuous Modelling of Speaking Style in …articles.ircam.fr/textes/Obin11b/index.pdfDiscrete/Continuous Modelling of Speaking Style in HMM-based Speech Synthesis: Design

technological chronicles); almost single speaker monologuewith a few interactions with a lead journalist.

sport commentary: soccer; two speakers engaged in mono-logues with speech overlapping during intense soccer sequencesand speech turn changes; almost no interaction.The speech database consists of natural speech multi-media au-dio contents with strongly variable audio quality (backgroundnoise: crowd, audience, recording noise, and reverberation).The speech prosody characteristics of the speech databased areillustrated in figure 1.

3. Speaking Style ModelA speaking style model λ(style) is composed of dis-crete/continuous context-dependent HMMs that model the sym-bolic/acoustic speech characteristics of a speaking style.

λ(style) =“λ

(style)symbolic,λ

(style)acoustic

”(1)

During the training, the discrete/continuous context-dependentHMMs are estimated separately. During the synthesis, the sym-bolic/acoustic parameters are generated in cascade, from thesymbolic representation to the acoustic variations. Addition-ally, a rich linguistic description of the text characteristics isautomatically extracted using a linguistic processing chain [6]and used to refine the context-dependent HMM modelling (see[7] and [8] for a detailed description of the enriched linguisticcontexts).

3.1. Training of the Discrete/Continuous Models

3.1.1. Discrete HMM

For each speaking style, an average context-dependent discreteHMM λ

(style)symbolic is estimated from the pooled speakers associ-

ated with the speaking style.

The prosodic grammar consists of a hierarchical prosodicrepresentation that was experimented as an alternative to ToBI[9] for French prosody labelling [10]. The prosodic grammaris composed of major prosodic boundaries (FM, a boundarywhich is right bounded by a pause), minor prosodic boundaries(Fm, an intermediate boundary), and prosodic prominences (P).

Let R be the number of speakers from which an average modelλ

(style)symbolic is to be estimated. Let l = (l(1), . . . , l(R))

the total set of prosodic symbolic observations, andl(r) = [l(r)(1), . . . , l(r)(Nr)] the prosodic symbolic se-quence associated with speaker r, where l(r)(n) is theprosodic label associated with the n-th syllable. Letq = (q(1), . . . ,q(R)) the total set of linguistic contextsobservations, and q(r) = [q(r)(1), . . . ,q(r)(Nr)] the lin-guistic context sequence associated with speaker r, whereq(r)(n) = [q

(r)1 (n), . . . , q

(r)L (n)]> is the (Lx1) linguistic

context vector which describes the linguistic characteristicsassociated with the n-th syllable.

An average context-dependent discrete HMM λ(style)symbolic is es-

timated from the pooled speakers observations. Firstly, an av-erage context-dependent tree T(style)

symbolic is derived so as to min-imize the information entropy of the prosodic symbolic labelsl conditionally to the linguistic contexts q . Then, a context-dependent HMM model λ

(style)symbolic is estimated for each termi-

nal node of the context-dependent tree T(style)symbolic.

3.1.2. Continuous HMM

For each speaking style, an average acoustic model λ(style)acoustic

that includes source/filter variations, f0 variations, and state-durations, is estimated from the pooled speakers associatedwith the speaking style based on the conventional HTS system[11].

Let R be the number of speakers from which an averagemodel is to be estimated. Let o = (o(1), . . . ,o(R)) thetotal set of observations, and o(r) = [o(r)(1), . . . ,o(r)(Tr)]the observation sequences associated with speaker r, whereo(r)(t) = [o

(r)t (1), . . . , o

(r)t (D)]> is the (Dx1) observation

vector which describes the acoustical property at time t. Letq = (q(1), . . . ,q(R)) the total set of linguistic contextsobservations, and q(r) = [q(r)(1), . . . ,q(r)(Tr)] the lin-guistic context sequence associated with speaker r, whereq(r)(t) = [q

(r)1 (t), . . . , q

(r)L (t)]> is the (Lx1) linguistic context

vector which describes the linguistic properties at time t.

An average context-dependent HMM acoustic model λ(style)symbolic

is estimated from the pooled speakers observations. Firstly, acontext-dependent HMM model is estimated for each of thelinguistic contexts. Then, an average context-dependent treeT(style)

acoustic is derived so as to minimize the description length ofthe context-dependent HMM model λ

(style)acoustic.

The acoustic module models simultaneously source/filter varia-tions, f0 variations, and the temporal structure associated witha speaking style. Speakers f0 were normalized with respectto the speaking style prior to modelling. Source, filter, andnormalized f0 observation vectors and their dynamic vectorsare used to estimate context-dependent HMM models λ

(style)acoustic.

Context-dependent HMMs are clustered into acoustically sim-ilar models using decision-tree-based context-clustering (ML-MDL [11]). Multi-Space probability Distributions (MSD) [12]are used to model continuous/discrete parameter f0 sequenceto manage voiced/unvoiced regions properly. Each context-dependent HMM is modelled with a state duration probabilitydensity functions (PDFs) to account for the temporal structureof speech [13]. Finally, speech dynamic is modelled accordingto the trajectory model and the global variance (GV) that modellocal and global speech variations over time [14].

3.2. Generation of the Speech Parameters

During the synthesis, the text is first converted into a con-catenated sequence of context-dependent HMM modelsλ

(style)symbolic associated with the linguistic context sequence

q = [q1, . . . ,qN ], where qn = [q1, . . . , qL]> denotesthe (Lx1) linguistic context vector associated with the n-thphoneme.

Firstly, the prosodic symbolic sequence bl is determined so asto maximize the likelihood of the prosodic symbolic sequence lconditionally to the linguistic context sequence q and the modelλ

(style)symbolic.

bl = argmaxl

p(l|q,λ(style)symbolic) (2)

Then, the linguistic context sequence q augmented with theprosodic symbolic sequence bl is converted into a concatenated

Page 3: Discrete/Continuous Modelling of Speaking Style in …articles.ircam.fr/textes/Obin11b/index.pdfDiscrete/Continuous Modelling of Speaking Style in HMM-based Speech Synthesis: Design

sequence of context-dependent models λ(style)acoustic.

The acoustic sequence bo is inferred so as to maximize the like-lihood of the acoustic sequence o conditionally to the modelλ

(style)acoustic.

bo = argmaxo

maxq

p(o|q,λ(style)acoustic)p(q|λ(style)

acoustic) (3)

First, the state sequence bq is determined so as to maximizethe likelihood of the state sequence conditionally to the modelλ

(style)acoustic. Then, the observation sequence bc is determined so

as to maximize the likelihood of the observation sequence con-ditionally to the state sequence bq, the model λ

(style)acoustic under

dynamic constraint o = Wc.

Rbqbc = rbq (4)

where:

Rbq = W>Σ−1bq W. (5)

rbq = W>Σ−1bq µbq. (6)

and Σbq and µbq are respectively the covariance matrix and themean vector for the state sequence bq.

52CHAPTER 4. SPEECH PROSODY MODELLING & SYNTHESIS:

STATE-OF-THE-ART

A phonetizer is used to convert the input text into a sequence of phonemes. A syllabifieris used to merge the sequence phonemes into a sequence of syllables. At the prosodic level,pauses are identified to

4.6 Abstract Model: Text To Prosodic Structure

The prosodic structure model is to infer a sequence of prosodic labels that are associatedwith relevant prosodic events.

sentence Longtemps , je me suis couche de bonne heure .

⇓prosodicstructureFM * *Fm * * *P * * * *

syllable Long- temps ## je me suis cou- che de bonne heure ##

Table 4.2: Illustration of the text-to-prosodic-structure conversion.

Two main approaches can be distinguished in prosodic structure modelling: on the onehand, expert approaches attempt at elaborating formal models that account for the observedprosodic variations with respect to linguistic, para-linguistic, and extra-linguistic con-straints. On the other hand, statistical methods attempt at elaborating a statistical modelwhich accounts for the prosodic variations from the observation of statistical regularities onlarge speech corpora.

4.6.1 Expert Models

Expert approaches mostly concern hierarchical prosodic structure mod-elling, and in particular prosodic frontiers ([Cooper and Paccia-Cooper, 1980,Gee and Grosjean, 1983, Ferreira, 1988, Abney, 1992, Watson and Gibson, 2004]for English; [Dell, 1984, Bailly, 1989, Monnin and Grosjean, 1993, Ladd, 1996,Delais-Roussarie, 2000, Mertens, 2004b] for French, [Barbosa, 2006] for some otherlanguages). Expert models assume that a prosodic structure results from the integrationof various and potentially conflictual constraints, in particular syntactic and rhythmic[?, Dell, 1984] constraints.The linguistic module mostly concerns the extraction of prominent syn-tactic boundaries from deep syntactic parsing, based on syntactic con-stituency (Constituent-Depth [Cooper and Paccia-Cooper, 1980], φ-phrases[Gee and Grosjean, 1983, Delais-Roussarie, 2000], Left-hand-side / Right-hand-sideBoundary [Watson and Gibson, 2004]), syntactic dependency (Dependency-Grammar-based

52CHAPTER 4. SPEECH PROSODY MODELLING & SYNTHESIS:

STATE-OF-THE-ART

A phonetizer is used to convert the input text into a sequence of phonemes. A syllabifieris used to merge the sequence phonemes into a sequence of syllables. At the prosodic level,pauses are identified to

4.6 Abstract Model: Text To Prosodic Structure

The prosodic structure model is to infer a sequence of prosodic labels that are associatedwith relevant prosodic events.

sentence Longtemps , je me suis couche de bonne heure .

⇓prosodicstructureFM * *Fm * * *P * * * *

syllable Long- temps ## je me suis cou- che de bonne heure ##

Table 4.2: Illustration of the text-to-prosodic-structure conversion.

Two main approaches can be distinguished in prosodic structure modelling: on the onehand, expert approaches attempt at elaborating formal models that account for the observedprosodic variations with respect to linguistic, para-linguistic, and extra-linguistic con-straints. On the other hand, statistical methods attempt at elaborating a statistical modelwhich accounts for the prosodic variations from the observation of statistical regularities onlarge speech corpora.

4.6.1 Expert Models

Expert approaches mostly concern hierarchical prosodic structure mod-elling, and in particular prosodic frontiers ([Cooper and Paccia-Cooper, 1980,Gee and Grosjean, 1983, Ferreira, 1988, Abney, 1992, Watson and Gibson, 2004]for English; [Dell, 1984, Bailly, 1989, Monnin and Grosjean, 1993, Ladd, 1996,Delais-Roussarie, 2000, Mertens, 2004b] for French, [Barbosa, 2006] for some otherlanguages). Expert models assume that a prosodic structure results from the integrationof various and potentially conflictual constraints, in particular syntactic and rhythmic[?, Dell, 1984] constraints.The linguistic module mostly concerns the extraction of prominent syn-tactic boundaries from deep syntactic parsing, based on syntactic con-stituency (Constituent-Depth [Cooper and Paccia-Cooper, 1980], φ-phrases[Gee and Grosjean, 1983, Delais-Roussarie, 2000], Left-hand-side / Right-hand-sideBoundary [Watson and Gibson, 2004]), syntactic dependency (Dependency-Grammar-based

52CHAPTER 4. SPEECH PROSODY MODELLING & SYNTHESIS:

STATE-OF-THE-ART

A phonetizer is used to convert the input text into a sequence of phonemes. A syllabifieris used to merge the sequence phonemes into a sequence of syllables. At the prosodic level,pauses are identified to

4.6 Abstract Model: Text To Prosodic Structure

The prosodic structure model is to infer a sequence of prosodic labels that are associatedwith relevant prosodic events.

sentence Longtemps , je me suis couche de bonne heure .

⇓prosodicstructureFM * *Fm * * *P * * * *

syllable Long- temps ## je me suis cou- che de bonne heure ##

Table 4.2: Illustration of the text-to-prosodic-structure conversion.

Two main approaches can be distinguished in prosodic structure modelling: on the onehand, expert approaches attempt at elaborating formal models that account for the observedprosodic variations with respect to linguistic, para-linguistic, and extra-linguistic con-straints. On the other hand, statistical methods attempt at elaborating a statistical modelwhich accounts for the prosodic variations from the observation of statistical regularities onlarge speech corpora.

4.6.1 Expert Models

Expert approaches mostly concern hierarchical prosodic structure mod-elling, and in particular prosodic frontiers ([Cooper and Paccia-Cooper, 1980,Gee and Grosjean, 1983, Ferreira, 1988, Abney, 1992, Watson and Gibson, 2004]for English; [Dell, 1984, Bailly, 1989, Monnin and Grosjean, 1993, Ladd, 1996,Delais-Roussarie, 2000, Mertens, 2004b] for French, [Barbosa, 2006] for some otherlanguages). Expert models assume that a prosodic structure results from the integrationof various and potentially conflictual constraints, in particular syntactic and rhythmic[?, Dell, 1984] constraints.The linguistic module mostly concerns the extraction of prominent syn-tactic boundaries from deep syntactic parsing, based on syntactic con-stituency (Constituent-Depth [Cooper and Paccia-Cooper, 1980], φ-phrases[Gee and Grosjean, 1983, Delais-Roussarie, 2000], Left-hand-side / Right-hand-sideBoundary [Watson and Gibson, 2004]), syntactic dependency (Dependency-Grammar-based

52CHAPTER 4. SPEECH PROSODY MODELLING & SYNTHESIS:

STATE-OF-THE-ART

A phonetizer is used to convert the input text into a sequence of phonemes. A syllabifieris used to merge the sequence phonemes into a sequence of syllables. At the prosodic level,pauses are identified to

4.6 Abstract Model: Text To Prosodic Structure

The prosodic structure model is to infer a sequence of prosodic labels that are associatedwith relevant prosodic events.

sentence Longtemps , je me suis couche de bonne heure .

⇓prosodicstructureFM * *Fm * * *P * * * *

syllable Long- temps ## je me suis cou- che de bonne heure ##

Table 4.2: Illustration of the text-to-prosodic-structure conversion.

Two main approaches can be distinguished in prosodic structure modelling: on the onehand, expert approaches attempt at elaborating formal models that account for the observedprosodic variations with respect to linguistic, para-linguistic, and extra-linguistic con-straints. On the other hand, statistical methods attempt at elaborating a statistical modelwhich accounts for the prosodic variations from the observation of statistical regularities onlarge speech corpora.

4.6.1 Expert Models

Expert approaches mostly concern hierarchical prosodic structure mod-elling, and in particular prosodic frontiers ([Cooper and Paccia-Cooper, 1980,Gee and Grosjean, 1983, Ferreira, 1988, Abney, 1992, Watson and Gibson, 2004]for English; [Dell, 1984, Bailly, 1989, Monnin and Grosjean, 1993, Ladd, 1996,Delais-Roussarie, 2000, Mertens, 2004b] for French, [Barbosa, 2006] for some otherlanguages). Expert models assume that a prosodic structure results from the integrationof various and potentially conflictual constraints, in particular syntactic and rhythmic[?, Dell, 1984] constraints.The linguistic module mostly concerns the extraction of prominent syn-tactic boundaries from deep syntactic parsing, based on syntactic con-stituency (Constituent-Depth [Cooper and Paccia-Cooper, 1980], φ-phrases[Gee and Grosjean, 1983, Delais-Roussarie, 2000], Left-hand-side / Right-hand-sideBoundary [Watson and Gibson, 2004]), syntactic dependency (Dependency-Grammar-based

52CHAPTER 4. SPEECH PROSODY MODELLING & SYNTHESIS:

STATE-OF-THE-ART

A phonetizer is used to convert the input text into a sequence of phonemes. A syllabifieris used to merge the sequence phonemes into a sequence of syllables. At the prosodic level,pauses are identified to

4.6 Abstract Model: Text To Prosodic Structure

The prosodic structure model is to infer a sequence of prosodic labels that are associatedwith relevant prosodic events.

sentence Longtemps , je me suis couche de bonne heure .

⇓prosodicstructureFM * *Fm * * *P * * * *

syllable Long- temps ## je me suis cou- che de bonne heure ##

Table 4.2: Illustration of the text-to-prosodic-structure conversion.

Two main approaches can be distinguished in prosodic structure modelling: on the onehand, expert approaches attempt at elaborating formal models that account for the observedprosodic variations with respect to linguistic, para-linguistic, and extra-linguistic con-straints. On the other hand, statistical methods attempt at elaborating a statistical modelwhich accounts for the prosodic variations from the observation of statistical regularities onlarge speech corpora.

4.6.1 Expert Models

Expert approaches mostly concern hierarchical prosodic structure mod-elling, and in particular prosodic frontiers ([Cooper and Paccia-Cooper, 1980,Gee and Grosjean, 1983, Ferreira, 1988, Abney, 1992, Watson and Gibson, 2004]for English; [Dell, 1984, Bailly, 1989, Monnin and Grosjean, 1993, Ladd, 1996,Delais-Roussarie, 2000, Mertens, 2004b] for French, [Barbosa, 2006] for some otherlanguages). Expert models assume that a prosodic structure results from the integrationof various and potentially conflictual constraints, in particular syntactic and rhythmic[?, Dell, 1984] constraints.The linguistic module mostly concerns the extraction of prominent syn-tactic boundaries from deep syntactic parsing, based on syntactic con-stituency (Constituent-Depth [Cooper and Paccia-Cooper, 1980], φ-phrases[Gee and Grosjean, 1983, Delais-Roussarie, 2000], Left-hand-side / Right-hand-sideBoundary [Watson and Gibson, 2004]), syntactic dependency (Dependency-Grammar-based

• journal: radio review (press review; political, economical, tech-nological chronicles); almost single speaker monologue with afew interactions with a lead journalist.

• sport commentary: soccer; two speakers engaged in mono-logues with speech overlapping during intense soccer sequencesand speech turn changes; almost no interactions.

The speech database consists of natural speech multi-media audiocontents with strongly variable audio quality (background noise:crowd, audience, recording noise, and reverberation). The speechprosody characteristics of the speech databased are illustrated in fig-ure 1.

3. SPEAKING STYLE MODEL

A speaking style model λ(style) is composed of discrete/continuouscontext-dependent HMMs that model the symbolic/acoustic speechcharacteristics of a speaking style.

λ(style) =“λ

(style)symbolic, λ

(style)acoustic

”(1)

During the training, the discrete/continuous context-dependentHMMs are estimated separately. During the synthesis, the sym-bolic/acoustic parameters are generated in cascade, from the sym-bolic representation to the acoustic variations. Additionally, a richlinguistic description of the text characteristics is automatically ex-tracted using a linguistic processing chain [9] and used to refine thecontext-dependent HMM modelling (see [10] and [11] for a detaileddescription of the enriched linguistic contexts).

3.1. Training of the Discrete/Continuous Models

3.1.1. Discrete HMM

For each speaking style, an average context-dependent discreteHMM λ

(style)symbolic is estimated from the pooled speakers associated

with the speaking style.

The prosodic grammar consists of a hierarchical prosodic repre-sentation that was experimented as an alternative to TOBI [12]for French prosody labelling [13]. The prosodic grammar iscomposed of major prosodic boundaries (FM, a boundary whichis right bounded by a pause), minor prosodic boundaries (Fm, anintermediate boundary), and prosodic prominences (P).

Let R be the number of speakers from which an average modelλ

(style)symbolic is to be estimated. Let qproso = (q

(1)proso, . . . ,q

(R)proso)

the total set of prosodic symbolic observations, andq

(r)proso = [q

(r)proso(1), . . . , q

(r)proso(Nr)] is the prosodic sym-

bolic sequence associated with speaker r, where q(r)proso(n)

is the prosodic label associated with syllable n. Letq = (q(1), . . . ,q(R)) the total set of linguistic contextsobservations, and q(r) = [q(r)(1), . . . ,q(r)(Nr)] is the lin-guistic context sequence associated with speaker r, whereq(r)(n) = [q

(r)1 (n), . . . , q

(r)L (n)]� is the (Lx1) linguistic context

vector which describes the linguistic characteristics associated withsyllable n.

An average context-dependent discrete HMM λ(style)symbolic is estimated

from the pooled speakers observations. Firstly, an average context-dependent tree T(style)

symbolic is derived so as to minimize the infor-

mation entropy of the prosodic symbolic labels qproso condition-ally to the linguistic contexts q . Then, a context-dependent HMMmodel λ

(style)symbolic is estimated for each terminal node of the context-

dependent tree T(style)symbolic.

3.1.2. Continuous HMM

For each speaking style, an average acoustic model λ(style)acoustic that

includes source/filter variations, f0 variations, and state-durations,is estimated from the pooled speakers associated with the speakingstyle based on the conventional HTS system ([14]).

Let R be the number of speakers from which an average model is tobe estimated. Let o = (o(1), . . . ,o(R)) the total set of observations,and o(r) = [o(r)(1), . . . ,o(r)(Tr)] is the observation sequencesassociated with speaker r, where o(r)(t) = [o

(r)t (1), . . . , o

(r)t (D)]�

is the (Dx1) observation vector which describes the acoustical prop-erty at time t. Let q = (q(1), . . . ,q(R)) the total set of linguisticcontexts observations, and q(r) = [q(r)(1), . . . ,q(r)(Tr)] isthe linguistic context sequence associated with speaker r, whereq(r)(t) = [q

(r)1 (t), . . . , q

(r)L (t)]� is the (L’x1) augmented linguistic

context vector which describes the linguistic properties at time t.

An average context-dependent HMM acoustic model λ(style)symbolic

is estimated from the pooled speakers observations. Firstly, acontext-dependent HMM model is estimated for each of thelinguistic contexts. Then, an average context-dependent treeT(style)

acoustic is derived so as to minimize the description length of thecontext-dependent HMM model λ

(style)acoustic.

The acoustic module models simultaneously source/filter variations,f0 variations, and the temporal symbolic associated with a speak-ing style. Speakers f0 were normalized with respect to the speakingstyle prior to modelling. Source, filter, and normalized f0 observa-tion vectors and their dynamic vectors are used to estimate context-dependent HMM models λ

(style)acoustic. Context-dependent HMMs are

clustered into acoustically similar models using decision-tree-basedcontext-clustering (ML-MDL [15]). Multi-Space probability Distri-butions (MSD) [16] are used to model continuous/discrete parame-ter f0 sequence to manage voiced/unvoiced regions properly. Eachcontext-dependent HMM is modelled with a state duration probabil-ity density functions (PDFs) to account for the temporal structure ofspeech [17]. Finally, speech dynamic is modelled according to thetrajectory model and the global variance (GV) that model local andglobal speech variations over time [?].

3.2. Generation of the Speech Parameters

During the synthesis, the text is first converted into a concatenatedsequence of context-dependent HMM models λ

(style)symbolic associated

with the linguistic context sequence q = [q1, . . . ,qN ], whereqn = [q1, . . . , qL]� denotes the (Lx1) linguistic context vectorassociated with linguistic unit n.

Firstly, the prosodic symbolic qproso is inferred so as to maximizethe log-likelihood of the prosodic symbolic sequence qproso condi-tionally to the linguistic context sequence q and the model λ(style)

symbolic.

qproso = argmaxqproso

P(qproso|q, λ(style)symbolic) (2)

• journal: radio review (press review; political, economical, tech-nological chronicles); almost single speaker monologue with afew interactions with a lead journalist.

• sport commentary: soccer; two speakers engaged in mono-logues with speech overlapping during intense soccer sequencesand speech turn changes; almost no interactions.

The speech database consists of natural speech multi-media audiocontents with strongly variable audio quality (background noise:crowd, audience, recording noise, and reverberation). The speechprosody characteristics of the speech databased are illustrated in fig-ure 1.

3. SPEAKING STYLE MODEL

A speaking style model λ(style) is composed of discrete/continuouscontext-dependent HMMs that model the symbolic/acoustic speechcharacteristics of a speaking style.

λ(style) =“λ

(style)symbolic, λ

(style)acoustic

”(1)

During the training, the discrete/continuous context-dependentHMMs are estimated separately. During the synthesis, the sym-bolic/acoustic parameters are generated in cascade, from the sym-bolic representation to the acoustic variations. Additionally, a richlinguistic description of the text characteristics is automatically ex-tracted using a linguistic processing chain [9] and used to refine thecontext-dependent HMM modelling (see [10] and [11] for a detaileddescription of the enriched linguistic contexts).

3.1. Training of the Discrete/Continuous Models

3.1.1. Discrete HMM

For each speaking style, an average context-dependent discreteHMM λ

(style)symbolic is estimated from the pooled speakers associated

with the speaking style.

The prosodic grammar consists of a hierarchical prosodic repre-sentation that was experimented as an alternative to TOBI [12]for French prosody labelling [13]. The prosodic grammar iscomposed of major prosodic boundaries (FM, a boundary whichis right bounded by a pause), minor prosodic boundaries (Fm, anintermediate boundary), and prosodic prominences (P).

Let R be the number of speakers from which an average modelλ

(style)symbolic is to be estimated. Let qproso = (q

(1)proso, . . . ,q

(R)proso)

the total set of prosodic symbolic observations, andq

(r)proso = [q

(r)proso(1), . . . , q

(r)proso(Nr)] is the prosodic sym-

bolic sequence associated with speaker r, where q(r)proso(n)

is the prosodic label associated with syllable n. Letq = (q(1), . . . ,q(R)) the total set of linguistic contextsobservations, and q(r) = [q(r)(1), . . . ,q(r)(Nr)] is the lin-guistic context sequence associated with speaker r, whereq(r)(n) = [q

(r)1 (n), . . . , q

(r)L (n)]� is the (Lx1) linguistic context

vector which describes the linguistic characteristics associated withsyllable n.

An average context-dependent discrete HMM λ(style)symbolic is estimated

from the pooled speakers observations. Firstly, an average context-dependent tree T(style)

symbolic is derived so as to minimize the infor-

mation entropy of the prosodic symbolic labels qproso condition-ally to the linguistic contexts q . Then, a context-dependent HMMmodel λ

(style)symbolic is estimated for each terminal node of the context-

dependent tree T(style)symbolic.

3.1.2. Continuous HMM

For each speaking style, an average acoustic model λ(style)acoustic that

includes source/filter variations, f0 variations, and state-durations,is estimated from the pooled speakers associated with the speakingstyle based on the conventional HTS system ([14]).

Let R be the number of speakers from which an average model is tobe estimated. Let o = (o(1), . . . ,o(R)) the total set of observations,and o(r) = [o(r)(1), . . . ,o(r)(Tr)] is the observation sequencesassociated with speaker r, where o(r)(t) = [o

(r)t (1), . . . , o

(r)t (D)]�

is the (Dx1) observation vector which describes the acoustical prop-erty at time t. Let q = (q(1), . . . ,q(R)) the total set of linguisticcontexts observations, and q(r) = [q(r)(1), . . . ,q(r)(Tr)] isthe linguistic context sequence associated with speaker r, whereq(r)(t) = [q

(r)1 (t), . . . , q

(r)L (t)]� is the (L’x1) augmented linguistic

context vector which describes the linguistic properties at time t.

An average context-dependent HMM acoustic model λ(style)symbolic

is estimated from the pooled speakers observations. Firstly, acontext-dependent HMM model is estimated for each of thelinguistic contexts. Then, an average context-dependent treeT(style)

acoustic is derived so as to minimize the description length of thecontext-dependent HMM model λ

(style)acoustic.

The acoustic module models simultaneously source/filter variations,f0 variations, and the temporal symbolic associated with a speak-ing style. Speakers f0 were normalized with respect to the speakingstyle prior to modelling. Source, filter, and normalized f0 observa-tion vectors and their dynamic vectors are used to estimate context-dependent HMM models λ

(style)acoustic. Context-dependent HMMs are

clustered into acoustically similar models using decision-tree-basedcontext-clustering (ML-MDL [15]). Multi-Space probability Distri-butions (MSD) [16] are used to model continuous/discrete parame-ter f0 sequence to manage voiced/unvoiced regions properly. Eachcontext-dependent HMM is modelled with a state duration probabil-ity density functions (PDFs) to account for the temporal structure ofspeech [17]. Finally, speech dynamic is modelled according to thetrajectory model and the global variance (GV) that model local andglobal speech variations over time [?].

3.2. Generation of the Speech Parameters

During the synthesis, the text is first converted into a concatenatedsequence of context-dependent HMM models λ

(style)symbolic associated

with the linguistic context sequence q = [q1, . . . ,qN ], whereqn = [q1, . . . , qL]� denotes the (Lx1) linguistic context vectorassociated with linguistic unit n.

Firstly, the prosodic symbolic qproso is inferred so as to maximizethe log-likelihood of the prosodic symbolic sequence qproso condi-tionally to the linguistic context sequence q and the model λ(style)

symbolic.

qproso = argmaxqproso

P(qproso|q, λ(style)symbolic) (2)

52CHAPTER 4. SPEECH PROSODY MODELLING & SYNTHESIS:

STATE-OF-THE-ART

A phonetizer is used to convert the input text into a sequence of phonemes. A syllabifieris used to merge the sequence phonemes into a sequence of syllables. At the prosodic level,pauses are identified to

4.6 Abstract Model: Text To Prosodic Structure

The prosodic structure model is to infer a sequence of prosodic labels that are associatedwith relevant prosodic events.

sentence Longtemps , je me suis couche de bonne heure .

⇓prosodicstructureFM * *Fm * * *P * * * *

syllable Long- temps ## je me suis cou- che de bonne heure ##

Table 4.2: Illustration of the text-to-prosodic-structure conversion.

Two main approaches can be distinguished in prosodic structure modelling: on the onehand, expert approaches attempt at elaborating formal models that account for the observedprosodic variations with respect to linguistic, para-linguistic, and extra-linguistic con-straints. On the other hand, statistical methods attempt at elaborating a statistical modelwhich accounts for the prosodic variations from the observation of statistical regularities onlarge speech corpora.

4.6.1 Expert Models

Expert approaches mostly concern hierarchical prosodic structure mod-elling, and in particular prosodic frontiers ([Cooper and Paccia-Cooper, 1980,Gee and Grosjean, 1983, Ferreira, 1988, Abney, 1992, Watson and Gibson, 2004]for English; [Dell, 1984, Bailly, 1989, Monnin and Grosjean, 1993, Ladd, 1996,Delais-Roussarie, 2000, Mertens, 2004b] for French, [Barbosa, 2006] for some otherlanguages). Expert models assume that a prosodic structure results from the integrationof various and potentially conflictual constraints, in particular syntactic and rhythmic[?, Dell, 1984] constraints.The linguistic module mostly concerns the extraction of prominent syn-tactic boundaries from deep syntactic parsing, based on syntactic con-stituency (Constituent-Depth [Cooper and Paccia-Cooper, 1980], φ-phrases[Gee and Grosjean, 1983, Delais-Roussarie, 2000], Left-hand-side / Right-hand-sideBoundary [Watson and Gibson, 2004]), syntactic dependency (Dependency-Grammar-basedtime [s]

ampl

itud

e

!0.4

!0.2

0

0.2

0.4

40

60

80

100

120

f 0[H

z]

l

o

ta Z

@m@

sH ik

uS

ed2b O n 9 R

time [s]

ampl

itude

!0.4

!0.2

0

0.2

0.4

40

60

80

100

120

f 0[H

z]

l

o

ta Z

@m@

sH ik

uS

ed2b O n 9 R

52CHAPTER 4. SPEECH PROSODY MODELLING & SYNTHESIS:

STATE-OF-THE-ART

A phonetizer is used to convert the input text into a sequence of phonemes. A syllabifieris used to merge the sequence phonemes into a sequence of syllables. At the prosodic level,pauses are identified to

4.6 Abstract Model: Text To Prosodic Structure

The prosodic structure model is to infer a sequence of prosodic labels that are associatedwith relevant prosodic events.

sentence Longtemps , je me suis couche de bonne heure .

⇓prosodicstructureFM * *Fm * * *P * * * *

syllable Long- temps ## je me suis cou- che de bonne heure ##

Table 4.2: Illustration of the text-to-prosodic-structure conversion.

Two main approaches can be distinguished in prosodic structure modelling: on the onehand, expert approaches attempt at elaborating formal models that account for the observedprosodic variations with respect to linguistic, para-linguistic, and extra-linguistic con-straints. On the other hand, statistical methods attempt at elaborating a statistical modelwhich accounts for the prosodic variations from the observation of statistical regularities onlarge speech corpora.

4.6.1 Expert Models

Expert approaches mostly concern hierarchical prosodic structure mod-elling, and in particular prosodic frontiers ([Cooper and Paccia-Cooper, 1980,Gee and Grosjean, 1983, Ferreira, 1988, Abney, 1992, Watson and Gibson, 2004]for English; [Dell, 1984, Bailly, 1989, Monnin and Grosjean, 1993, Ladd, 1996,Delais-Roussarie, 2000, Mertens, 2004b] for French, [Barbosa, 2006] for some otherlanguages). Expert models assume that a prosodic structure results from the integrationof various and potentially conflictual constraints, in particular syntactic and rhythmic[?, Dell, 1984] constraints.The linguistic module mostly concerns the extraction of prominent syn-tactic boundaries from deep syntactic parsing, based on syntactic con-stituency (Constituent-Depth [Cooper and Paccia-Cooper, 1980], φ-phrases[Gee and Grosjean, 1983, Delais-Roussarie, 2000], Left-hand-side / Right-hand-sideBoundary [Watson and Gibson, 2004]), syntactic dependency (Dependency-Grammar-based

52CHAPTER 4. SPEECH PROSODY MODELLING & SYNTHESIS:

STATE-OF-THE-ART

A phonetizer is used to convert the input text into a sequence of phonemes. A syllabifieris used to merge the sequence phonemes into a sequence of syllables. At the prosodic level,pauses are identified to

4.6 Abstract Model: Text To Prosodic Structure

The prosodic structure model is to infer a sequence of prosodic labels that are associatedwith relevant prosodic events.

sentence Longtemps , je me suis couche de bonne heure .

⇓prosodicstructureFM * *Fm * * *P * * * *

syllable Long- temps ## je me suis cou- che de bonne heure ##

Table 4.2: Illustration of the text-to-prosodic-structure conversion.

Two main approaches can be distinguished in prosodic structure modelling: on the onehand, expert approaches attempt at elaborating formal models that account for the observedprosodic variations with respect to linguistic, para-linguistic, and extra-linguistic con-straints. On the other hand, statistical methods attempt at elaborating a statistical modelwhich accounts for the prosodic variations from the observation of statistical regularities onlarge speech corpora.

4.6.1 Expert Models

Expert approaches mostly concern hierarchical prosodic structure mod-elling, and in particular prosodic frontiers ([Cooper and Paccia-Cooper, 1980,Gee and Grosjean, 1983, Ferreira, 1988, Abney, 1992, Watson and Gibson, 2004]for English; [Dell, 1984, Bailly, 1989, Monnin and Grosjean, 1993, Ladd, 1996,Delais-Roussarie, 2000, Mertens, 2004b] for French, [Barbosa, 2006] for some otherlanguages). Expert models assume that a prosodic structure results from the integrationof various and potentially conflictual constraints, in particular syntactic and rhythmic[?, Dell, 1984] constraints.The linguistic module mostly concerns the extraction of prominent syn-tactic boundaries from deep syntactic parsing, based on syntactic con-stituency (Constituent-Depth [Cooper and Paccia-Cooper, 1980], φ-phrases[Gee and Grosjean, 1983, Delais-Roussarie, 2000], Left-hand-side / Right-hand-sideBoundary [Watson and Gibson, 2004]), syntactic dependency (Dependency-Grammar-based

52CHAPTER 4. SPEECH PROSODY MODELLING & SYNTHESIS:

STATE-OF-THE-ART

A phonetizer is used to convert the input text into a sequence of phonemes. A syllabifieris used to merge the sequence phonemes into a sequence of syllables. At the prosodic level,pauses are identified to

4.6 Abstract Model: Text To Prosodic Structure

The prosodic structure model is to infer a sequence of prosodic labels that are associatedwith relevant prosodic events.

sentence Longtemps , je me suis couche de bonne heure .

⇓prosodicstructureFM * *Fm * * *P * * * *

syllable Long- temps ## je me suis cou- che de bonne heure ##

Table 4.2: Illustration of the text-to-prosodic-structure conversion.

Two main approaches can be distinguished in prosodic structure modelling: on the onehand, expert approaches attempt at elaborating formal models that account for the observedprosodic variations with respect to linguistic, para-linguistic, and extra-linguistic con-straints. On the other hand, statistical methods attempt at elaborating a statistical modelwhich accounts for the prosodic variations from the observation of statistical regularities onlarge speech corpora.

4.6.1 Expert Models

Expert approaches mostly concern hierarchical prosodic structure mod-elling, and in particular prosodic frontiers ([Cooper and Paccia-Cooper, 1980,Gee and Grosjean, 1983, Ferreira, 1988, Abney, 1992, Watson and Gibson, 2004]for English; [Dell, 1984, Bailly, 1989, Monnin and Grosjean, 1993, Ladd, 1996,Delais-Roussarie, 2000, Mertens, 2004b] for French, [Barbosa, 2006] for some otherlanguages). Expert models assume that a prosodic structure results from the integrationof various and potentially conflictual constraints, in particular syntactic and rhythmic[?, Dell, 1984] constraints.The linguistic module mostly concerns the extraction of prominent syn-tactic boundaries from deep syntactic parsing, based on syntactic con-stituency (Constituent-Depth [Cooper and Paccia-Cooper, 1980], φ-phrases[Gee and Grosjean, 1983, Delais-Roussarie, 2000], Left-hand-side / Right-hand-sideBoundary [Watson and Gibson, 2004]), syntactic dependency (Dependency-Grammar-based

52CHAPTER 4. SPEECH PROSODY MODELLING & SYNTHESIS:

STATE-OF-THE-ART

A phonetizer is used to convert the input text into a sequence of phonemes. A syllabifieris used to merge the sequence phonemes into a sequence of syllables. At the prosodic level,pauses are identified to

4.6 Abstract Model: Text To Prosodic Structure

The prosodic structure model is to infer a sequence of prosodic labels that are associatedwith relevant prosodic events.

sentence Longtemps , je me suis couche de bonne heure .

⇓prosodicstructureFM * *Fm * * *P * * * *

syllable Long- temps ## je me suis cou- che de bonne heure ##

Table 4.2: Illustration of the text-to-prosodic-structure conversion.

Two main approaches can be distinguished in prosodic structure modelling: on the onehand, expert approaches attempt at elaborating formal models that account for the observedprosodic variations with respect to linguistic, para-linguistic, and extra-linguistic con-straints. On the other hand, statistical methods attempt at elaborating a statistical modelwhich accounts for the prosodic variations from the observation of statistical regularities onlarge speech corpora.

4.6.1 Expert Models

Expert approaches mostly concern hierarchical prosodic structure mod-elling, and in particular prosodic frontiers ([Cooper and Paccia-Cooper, 1980,Gee and Grosjean, 1983, Ferreira, 1988, Abney, 1992, Watson and Gibson, 2004]for English; [Dell, 1984, Bailly, 1989, Monnin and Grosjean, 1993, Ladd, 1996,Delais-Roussarie, 2000, Mertens, 2004b] for French, [Barbosa, 2006] for some otherlanguages). Expert models assume that a prosodic structure results from the integrationof various and potentially conflictual constraints, in particular syntactic and rhythmic[?, Dell, 1984] constraints.The linguistic module mostly concerns the extraction of prominent syn-tactic boundaries from deep syntactic parsing, based on syntactic con-stituency (Constituent-Depth [Cooper and Paccia-Cooper, 1980], φ-phrases[Gee and Grosjean, 1983, Delais-Roussarie, 2000], Left-hand-side / Right-hand-sideBoundary [Watson and Gibson, 2004]), syntactic dependency (Dependency-Grammar-based

• journal: radio review (press review; political, economical, tech-nological chronicles); almost single speaker monologue with afew interactions with a lead journalist.

• sport commentary: soccer; two speakers engaged in mono-logues with speech overlapping during intense soccer sequencesand speech turn changes; almost no interactions.

The speech database consists of natural speech multi-media audiocontents with strongly variable audio quality (background noise:crowd, audience, recording noise, and reverberation). The speechprosody characteristics of the speech databased are illustrated in fig-ure 1.

3. SPEAKING STYLE MODEL

A speaking style model λ(style) is composed of discrete/continuouscontext-dependent HMMs that model the symbolic/acoustic speechcharacteristics of a speaking style.

λ(style) =“λ

(style)symbolic, λ

(style)acoustic

”(1)

During the training, the discrete/continuous context-dependentHMMs are estimated separately. During the synthesis, the sym-bolic/acoustic parameters are generated in cascade, from the sym-bolic representation to the acoustic variations. Additionally, a richlinguistic description of the text characteristics is automatically ex-tracted using a linguistic processing chain [9] and used to refine thecontext-dependent HMM modelling (see [10] and [11] for a detaileddescription of the enriched linguistic contexts).

3.1. Training of the Discrete/Continuous Models

3.1.1. Discrete HMM

For each speaking style, an average context-dependent discreteHMM λ

(style)symbolic is estimated from the pooled speakers associated

with the speaking style.

The prosodic grammar consists of a hierarchical prosodic repre-sentation that was experimented as an alternative to TOBI [12]for French prosody labelling [13]. The prosodic grammar iscomposed of major prosodic boundaries (FM, a boundary whichis right bounded by a pause), minor prosodic boundaries (Fm, anintermediate boundary), and prosodic prominences (P).

Let R be the number of speakers from which an average modelλ

(style)symbolic is to be estimated. Let qproso = (q

(1)proso, . . . ,q

(R)proso)

the total set of prosodic symbolic observations, andq

(r)proso = [q

(r)proso(1), . . . , q

(r)proso(Nr)] is the prosodic sym-

bolic sequence associated with speaker r, where q(r)proso(n)

is the prosodic label associated with syllable n. Letq = (q(1), . . . ,q(R)) the total set of linguistic contextsobservations, and q(r) = [q(r)(1), . . . ,q(r)(Nr)] is the lin-guistic context sequence associated with speaker r, whereq(r)(n) = [q

(r)1 (n), . . . , q

(r)L (n)]� is the (Lx1) linguistic context

vector which describes the linguistic characteristics associated withsyllable n.

An average context-dependent discrete HMM λ(style)symbolic is estimated

from the pooled speakers observations. Firstly, an average context-dependent tree T(style)

symbolic is derived so as to minimize the infor-

mation entropy of the prosodic symbolic labels qproso condition-ally to the linguistic contexts q . Then, a context-dependent HMMmodel λ

(style)symbolic is estimated for each terminal node of the context-

dependent tree T(style)symbolic.

3.1.2. Continuous HMM

For each speaking style, an average acoustic model λ(style)acoustic that

includes source/filter variations, f0 variations, and state-durations,is estimated from the pooled speakers associated with the speakingstyle based on the conventional HTS system ([14]).

Let R be the number of speakers from which an average model is tobe estimated. Let o = (o(1), . . . ,o(R)) the total set of observations,and o(r) = [o(r)(1), . . . ,o(r)(Tr)] is the observation sequencesassociated with speaker r, where o(r)(t) = [o

(r)t (1), . . . , o

(r)t (D)]�

is the (Dx1) observation vector which describes the acoustical prop-erty at time t. Let q = (q(1), . . . ,q(R)) the total set of linguisticcontexts observations, and q(r) = [q(r)(1), . . . ,q(r)(Tr)] isthe linguistic context sequence associated with speaker r, whereq(r)(t) = [q

(r)1 (t), . . . , q

(r)L (t)]� is the (L’x1) augmented linguistic

context vector which describes the linguistic properties at time t.

An average context-dependent HMM acoustic model λ(style)symbolic

is estimated from the pooled speakers observations. Firstly, acontext-dependent HMM model is estimated for each of thelinguistic contexts. Then, an average context-dependent treeT(style)

acoustic is derived so as to minimize the description length of thecontext-dependent HMM model λ

(style)acoustic.

The acoustic module models simultaneously source/filter variations,f0 variations, and the temporal symbolic associated with a speak-ing style. Speakers f0 were normalized with respect to the speakingstyle prior to modelling. Source, filter, and normalized f0 observa-tion vectors and their dynamic vectors are used to estimate context-dependent HMM models λ

(style)acoustic. Context-dependent HMMs are

clustered into acoustically similar models using decision-tree-basedcontext-clustering (ML-MDL [15]). Multi-Space probability Distri-butions (MSD) [16] are used to model continuous/discrete parame-ter f0 sequence to manage voiced/unvoiced regions properly. Eachcontext-dependent HMM is modelled with a state duration probabil-ity density functions (PDFs) to account for the temporal structure ofspeech [17]. Finally, speech dynamic is modelled according to thetrajectory model and the global variance (GV) that model local andglobal speech variations over time [?].

3.2. Generation of the Speech Parameters

During the synthesis, the text is first converted into a concatenatedsequence of context-dependent HMM models λ

(style)symbolic associated

with the linguistic context sequence q = [q1, . . . ,qN ], whereqn = [q1, . . . , qL]� denotes the (Lx1) linguistic context vectorassociated with linguistic unit n.

Firstly, the prosodic symbolic qproso is inferred so as to maximizethe log-likelihood of the prosodic symbolic sequence qproso condi-tionally to the linguistic context sequence q and the model λ(style)

symbolic.

qproso = argmaxqproso

P(qproso|q, λ(style)symbolic) (2)

• journal: radio review (press review; political, economical, tech-nological chronicles); almost single speaker monologue with afew interactions with a lead journalist.

• sport commentary: soccer; two speakers engaged in mono-logues with speech overlapping during intense soccer sequencesand speech turn changes; almost no interactions.

The speech database consists of natural speech multi-media audiocontents with strongly variable audio quality (background noise:crowd, audience, recording noise, and reverberation). The speechprosody characteristics of the speech databased are illustrated in fig-ure 1.

3. SPEAKING STYLE MODEL

A speaking style model λ(style) is composed of discrete/continuouscontext-dependent HMMs that model the symbolic/acoustic speechcharacteristics of a speaking style.

λ(style) =“λ

(style)symbolic, λ

(style)acoustic

”(1)

During the training, the discrete/continuous context-dependentHMMs are estimated separately. During the synthesis, the sym-bolic/acoustic parameters are generated in cascade, from the sym-bolic representation to the acoustic variations. Additionally, a richlinguistic description of the text characteristics is automatically ex-tracted using a linguistic processing chain [9] and used to refine thecontext-dependent HMM modelling (see [10] and [11] for a detaileddescription of the enriched linguistic contexts).

3.1. Training of the Discrete/Continuous Models

3.1.1. Discrete HMM

For each speaking style, an average context-dependent discreteHMM λ

(style)symbolic is estimated from the pooled speakers associated

with the speaking style.

The prosodic grammar consists of a hierarchical prosodic repre-sentation that was experimented as an alternative to TOBI [12]for French prosody labelling [13]. The prosodic grammar iscomposed of major prosodic boundaries (FM, a boundary whichis right bounded by a pause), minor prosodic boundaries (Fm, anintermediate boundary), and prosodic prominences (P).

Let R be the number of speakers from which an average modelλ

(style)symbolic is to be estimated. Let qproso = (q

(1)proso, . . . ,q

(R)proso)

the total set of prosodic symbolic observations, andq

(r)proso = [q

(r)proso(1), . . . , q

(r)proso(Nr)] is the prosodic sym-

bolic sequence associated with speaker r, where q(r)proso(n)

is the prosodic label associated with syllable n. Letq = (q(1), . . . ,q(R)) the total set of linguistic contextsobservations, and q(r) = [q(r)(1), . . . ,q(r)(Nr)] is the lin-guistic context sequence associated with speaker r, whereq(r)(n) = [q

(r)1 (n), . . . , q

(r)L (n)]� is the (Lx1) linguistic context

vector which describes the linguistic characteristics associated withsyllable n.

An average context-dependent discrete HMM λ(style)symbolic is estimated

from the pooled speakers observations. Firstly, an average context-dependent tree T(style)

symbolic is derived so as to minimize the infor-

mation entropy of the prosodic symbolic labels qproso condition-ally to the linguistic contexts q . Then, a context-dependent HMMmodel λ

(style)symbolic is estimated for each terminal node of the context-

dependent tree T(style)symbolic.

3.1.2. Continuous HMM

For each speaking style, an average acoustic model λ(style)acoustic that

includes source/filter variations, f0 variations, and state-durations,is estimated from the pooled speakers associated with the speakingstyle based on the conventional HTS system ([14]).

Let R be the number of speakers from which an average model is tobe estimated. Let o = (o(1), . . . ,o(R)) the total set of observations,and o(r) = [o(r)(1), . . . ,o(r)(Tr)] is the observation sequencesassociated with speaker r, where o(r)(t) = [o

(r)t (1), . . . , o

(r)t (D)]�

is the (Dx1) observation vector which describes the acoustical prop-erty at time t. Let q = (q(1), . . . ,q(R)) the total set of linguisticcontexts observations, and q(r) = [q(r)(1), . . . ,q(r)(Tr)] isthe linguistic context sequence associated with speaker r, whereq(r)(t) = [q

(r)1 (t), . . . , q

(r)L (t)]� is the (L’x1) augmented linguistic

context vector which describes the linguistic properties at time t.

An average context-dependent HMM acoustic model λ(style)symbolic

is estimated from the pooled speakers observations. Firstly, acontext-dependent HMM model is estimated for each of thelinguistic contexts. Then, an average context-dependent treeT(style)

acoustic is derived so as to minimize the description length of thecontext-dependent HMM model λ

(style)acoustic.

The acoustic module models simultaneously source/filter variations,f0 variations, and the temporal symbolic associated with a speak-ing style. Speakers f0 were normalized with respect to the speakingstyle prior to modelling. Source, filter, and normalized f0 observa-tion vectors and their dynamic vectors are used to estimate context-dependent HMM models λ

(style)acoustic. Context-dependent HMMs are

clustered into acoustically similar models using decision-tree-basedcontext-clustering (ML-MDL [15]). Multi-Space probability Distri-butions (MSD) [16] are used to model continuous/discrete parame-ter f0 sequence to manage voiced/unvoiced regions properly. Eachcontext-dependent HMM is modelled with a state duration probabil-ity density functions (PDFs) to account for the temporal structure ofspeech [17]. Finally, speech dynamic is modelled according to thetrajectory model and the global variance (GV) that model local andglobal speech variations over time [?].

3.2. Generation of the Speech Parameters

During the synthesis, the text is first converted into a concatenatedsequence of context-dependent HMM models λ

(style)symbolic associated

with the linguistic context sequence q = [q1, . . . ,qN ], whereqn = [q1, . . . , qL]� denotes the (Lx1) linguistic context vectorassociated with linguistic unit n.

Firstly, the prosodic symbolic qproso is inferred so as to maximizethe log-likelihood of the prosodic symbolic sequence qproso condi-tionally to the linguistic context sequence q and the model λ(style)

symbolic.

qproso = argmaxqproso

P(qproso|q, λ(style)symbolic) (2)

52CHAPTER 4. SPEECH PROSODY MODELLING & SYNTHESIS:

STATE-OF-THE-ART

A phonetizer is used to convert the input text into a sequence of phonemes. A syllabifieris used to merge the sequence phonemes into a sequence of syllables. At the prosodic level,pauses are identified to

4.6 Abstract Model: Text To Prosodic Structure

The prosodic structure model is to infer a sequence of prosodic labels that are associatedwith relevant prosodic events.

sentence Longtemps , je me suis couche de bonne heure .

⇓prosodicstructureFM * *Fm * * *P * * * *

syllable Long- temps ## je me suis cou- che de bonne heure ##

Table 4.2: Illustration of the text-to-prosodic-structure conversion.

Two main approaches can be distinguished in prosodic structure modelling: on the onehand, expert approaches attempt at elaborating formal models that account for the observedprosodic variations with respect to linguistic, para-linguistic, and extra-linguistic con-straints. On the other hand, statistical methods attempt at elaborating a statistical modelwhich accounts for the prosodic variations from the observation of statistical regularities onlarge speech corpora.

4.6.1 Expert Models

Expert approaches mostly concern hierarchical prosodic structure mod-elling, and in particular prosodic frontiers ([Cooper and Paccia-Cooper, 1980,Gee and Grosjean, 1983, Ferreira, 1988, Abney, 1992, Watson and Gibson, 2004]for English; [Dell, 1984, Bailly, 1989, Monnin and Grosjean, 1993, Ladd, 1996,Delais-Roussarie, 2000, Mertens, 2004b] for French, [Barbosa, 2006] for some otherlanguages). Expert models assume that a prosodic structure results from the integrationof various and potentially conflictual constraints, in particular syntactic and rhythmic[?, Dell, 1984] constraints.The linguistic module mostly concerns the extraction of prominent syn-tactic boundaries from deep syntactic parsing, based on syntactic con-stituency (Constituent-Depth [Cooper and Paccia-Cooper, 1980], φ-phrases[Gee and Grosjean, 1983, Delais-Roussarie, 2000], Left-hand-side / Right-hand-sideBoundary [Watson and Gibson, 2004]), syntactic dependency (Dependency-Grammar-based

time [s]

ampl

itud

e

!0.4

!0.2

0

0.2

0.4

40

60

80

100

120

f 0[H

z]

l

o

ta Z

@m@

sH ik

uS

ed2b O n 9 R

52CHAPTER 4. SPEECH PROSODY MODELLING & SYNTHESIS:

STATE-OF-THE-ART

A phonetizer is used to convert the input text into a sequence of phonemes. A syllabifieris used to merge the sequence phonemes into a sequence of syllables. At the prosodic level,pauses are identified to

4.6 Abstract Model: Text To Prosodic Structure

The prosodic structure model is to infer a sequence of prosodic labels that are associatedwith relevant prosodic events.

sentence Longtemps , je me suis couche de bonne heure .

⇓prosodicstructureFM * *Fm * * *P * * * *

syllable Long- temps ## je me suis cou- che de bonne heure ##

Table 4.2: Illustration of the text-to-prosodic-structure conversion.

Two main approaches can be distinguished in prosodic structure modelling: on the onehand, expert approaches attempt at elaborating formal models that account for the observedprosodic variations with respect to linguistic, para-linguistic, and extra-linguistic con-straints. On the other hand, statistical methods attempt at elaborating a statistical modelwhich accounts for the prosodic variations from the observation of statistical regularities onlarge speech corpora.

4.6.1 Expert Models

Expert approaches mostly concern hierarchical prosodic structure mod-elling, and in particular prosodic frontiers ([Cooper and Paccia-Cooper, 1980,Gee and Grosjean, 1983, Ferreira, 1988, Abney, 1992, Watson and Gibson, 2004]for English; [Dell, 1984, Bailly, 1989, Monnin and Grosjean, 1993, Ladd, 1996,Delais-Roussarie, 2000, Mertens, 2004b] for French, [Barbosa, 2006] for some otherlanguages). Expert models assume that a prosodic structure results from the integrationof various and potentially conflictual constraints, in particular syntactic and rhythmic[?, Dell, 1984] constraints.The linguistic module mostly concerns the extraction of prominent syn-tactic boundaries from deep syntactic parsing, based on syntactic con-stituency (Constituent-Depth [Cooper and Paccia-Cooper, 1980], φ-phrases[Gee and Grosjean, 1983, Delais-Roussarie, 2000], Left-hand-side / Right-hand-sideBoundary [Watson and Gibson, 2004]), syntactic dependency (Dependency-Grammar-based

52CHAPTER 4. SPEECH PROSODY MODELLING & SYNTHESIS:

STATE-OF-THE-ART

A phonetizer is used to convert the input text into a sequence of phonemes. A syllabifieris used to merge the sequence phonemes into a sequence of syllables. At the prosodic level,pauses are identified to

4.6 Abstract Model: Text To Prosodic Structure

The prosodic structure model is to infer a sequence of prosodic labels that are associatedwith relevant prosodic events.

sentence Longtemps , je me suis couche de bonne heure .

⇓prosodicstructureFM * *Fm * * *P * * * *

syllable Long- temps ## je me suis cou- che de bonne heure ##

Table 4.2: Illustration of the text-to-prosodic-structure conversion.

Two main approaches can be distinguished in prosodic structure modelling: on the onehand, expert approaches attempt at elaborating formal models that account for the observedprosodic variations with respect to linguistic, para-linguistic, and extra-linguistic con-straints. On the other hand, statistical methods attempt at elaborating a statistical modelwhich accounts for the prosodic variations from the observation of statistical regularities onlarge speech corpora.

4.6.1 Expert Models

Expert approaches mostly concern hierarchical prosodic structure mod-elling, and in particular prosodic frontiers ([Cooper and Paccia-Cooper, 1980,Gee and Grosjean, 1983, Ferreira, 1988, Abney, 1992, Watson and Gibson, 2004]for English; [Dell, 1984, Bailly, 1989, Monnin and Grosjean, 1993, Ladd, 1996,Delais-Roussarie, 2000, Mertens, 2004b] for French, [Barbosa, 2006] for some otherlanguages). Expert models assume that a prosodic structure results from the integrationof various and potentially conflictual constraints, in particular syntactic and rhythmic[?, Dell, 1984] constraints.The linguistic module mostly concerns the extraction of prominent syn-tactic boundaries from deep syntactic parsing, based on syntactic con-stituency (Constituent-Depth [Cooper and Paccia-Cooper, 1980], φ-phrases[Gee and Grosjean, 1983, Delais-Roussarie, 2000], Left-hand-side / Right-hand-sideBoundary [Watson and Gibson, 2004]), syntactic dependency (Dependency-Grammar-based

52CHAPTER 4. SPEECH PROSODY MODELLING & SYNTHESIS:

STATE-OF-THE-ART

A phonetizer is used to convert the input text into a sequence of phonemes. A syllabifieris used to merge the sequence phonemes into a sequence of syllables. At the prosodic level,pauses are identified to

4.6 Abstract Model: Text To Prosodic Structure

The prosodic structure model is to infer a sequence of prosodic labels that are associatedwith relevant prosodic events.

sentence Longtemps , je me suis couche de bonne heure .

⇓prosodicstructureFM * *Fm * * *P * * * *

syllable Long- temps ## je me suis cou- che de bonne heure ##

Table 4.2: Illustration of the text-to-prosodic-structure conversion.

Two main approaches can be distinguished in prosodic structure modelling: on the onehand, expert approaches attempt at elaborating formal models that account for the observedprosodic variations with respect to linguistic, para-linguistic, and extra-linguistic con-straints. On the other hand, statistical methods attempt at elaborating a statistical modelwhich accounts for the prosodic variations from the observation of statistical regularities onlarge speech corpora.

4.6.1 Expert Models

Expert approaches mostly concern hierarchical prosodic structure mod-elling, and in particular prosodic frontiers ([Cooper and Paccia-Cooper, 1980,Gee and Grosjean, 1983, Ferreira, 1988, Abney, 1992, Watson and Gibson, 2004]for English; [Dell, 1984, Bailly, 1989, Monnin and Grosjean, 1993, Ladd, 1996,Delais-Roussarie, 2000, Mertens, 2004b] for French, [Barbosa, 2006] for some otherlanguages). Expert models assume that a prosodic structure results from the integrationof various and potentially conflictual constraints, in particular syntactic and rhythmic[?, Dell, 1984] constraints.The linguistic module mostly concerns the extraction of prominent syn-tactic boundaries from deep syntactic parsing, based on syntactic con-stituency (Constituent-Depth [Cooper and Paccia-Cooper, 1980], φ-phrases[Gee and Grosjean, 1983, Delais-Roussarie, 2000], Left-hand-side / Right-hand-sideBoundary [Watson and Gibson, 2004]), syntactic dependency (Dependency-Grammar-based

52CHAPTER 4. SPEECH PROSODY MODELLING & SYNTHESIS:

STATE-OF-THE-ART

A phonetizer is used to convert the input text into a sequence of phonemes. A syllabifieris used to merge the sequence phonemes into a sequence of syllables. At the prosodic level,pauses are identified to

4.6 Abstract Model: Text To Prosodic Structure

The prosodic structure model is to infer a sequence of prosodic labels that are associatedwith relevant prosodic events.

sentence Longtemps , je me suis couche de bonne heure .

⇓prosodicstructureFM * *Fm * * *P * * * *

syllable Long- temps ## je me suis cou- che de bonne heure ##

Table 4.2: Illustration of the text-to-prosodic-structure conversion.

Two main approaches can be distinguished in prosodic structure modelling: on the onehand, expert approaches attempt at elaborating formal models that account for the observedprosodic variations with respect to linguistic, para-linguistic, and extra-linguistic con-straints. On the other hand, statistical methods attempt at elaborating a statistical modelwhich accounts for the prosodic variations from the observation of statistical regularities onlarge speech corpora.

4.6.1 Expert Models

Expert approaches mostly concern hierarchical prosodic structure mod-elling, and in particular prosodic frontiers ([Cooper and Paccia-Cooper, 1980,Gee and Grosjean, 1983, Ferreira, 1988, Abney, 1992, Watson and Gibson, 2004]for English; [Dell, 1984, Bailly, 1989, Monnin and Grosjean, 1993, Ladd, 1996,Delais-Roussarie, 2000, Mertens, 2004b] for French, [Barbosa, 2006] for some otherlanguages). Expert models assume that a prosodic structure results from the integrationof various and potentially conflictual constraints, in particular syntactic and rhythmic[?, Dell, 1984] constraints.The linguistic module mostly concerns the extraction of prominent syn-tactic boundaries from deep syntactic parsing, based on syntactic con-stituency (Constituent-Depth [Cooper and Paccia-Cooper, 1980], φ-phrases[Gee and Grosjean, 1983, Delais-Roussarie, 2000], Left-hand-side / Right-hand-sideBoundary [Watson and Gibson, 2004]), syntactic dependency (Dependency-Grammar-based

• journal: radio review (press review; political, economical, tech-nological chronicles); almost single speaker monologue with afew interactions with a lead journalist.

• sport commentary: soccer; two speakers engaged in mono-logues with speech overlapping during intense soccer sequencesand speech turn changes; almost no interactions.

The speech database consists of natural speech multi-media audiocontents with strongly variable audio quality (background noise:crowd, audience, recording noise, and reverberation). The speechprosody characteristics of the speech databased are illustrated in fig-ure 1.

3. SPEAKING STYLE MODEL

A speaking style model λ(style) is composed of discrete/continuouscontext-dependent HMMs that model the symbolic/acoustic speechcharacteristics of a speaking style.

λ(style) =“λ

(style)symbolic, λ

(style)acoustic

”(1)

During the training, the discrete/continuous context-dependentHMMs are estimated separately. During the synthesis, the sym-bolic/acoustic parameters are generated in cascade, from the sym-bolic representation to the acoustic variations. Additionally, a richlinguistic description of the text characteristics is automatically ex-tracted using a linguistic processing chain [9] and used to refine thecontext-dependent HMM modelling (see [10] and [11] for a detaileddescription of the enriched linguistic contexts).

3.1. Training of the Discrete/Continuous Models

3.1.1. Discrete HMM

For each speaking style, an average context-dependent discreteHMM λ

(style)symbolic is estimated from the pooled speakers associated

with the speaking style.

The prosodic grammar consists of a hierarchical prosodic repre-sentation that was experimented as an alternative to TOBI [12]for French prosody labelling [13]. The prosodic grammar iscomposed of major prosodic boundaries (FM, a boundary whichis right bounded by a pause), minor prosodic boundaries (Fm, anintermediate boundary), and prosodic prominences (P).

Let R be the number of speakers from which an average modelλ

(style)symbolic is to be estimated. Let qproso = (q

(1)proso, . . . ,q

(R)proso)

the total set of prosodic symbolic observations, andq

(r)proso = [q

(r)proso(1), . . . , q

(r)proso(Nr)] is the prosodic sym-

bolic sequence associated with speaker r, where q(r)proso(n)

is the prosodic label associated with syllable n. Letq = (q(1), . . . ,q(R)) the total set of linguistic contextsobservations, and q(r) = [q(r)(1), . . . ,q(r)(Nr)] is the lin-guistic context sequence associated with speaker r, whereq(r)(n) = [q

(r)1 (n), . . . , q

(r)L (n)]� is the (Lx1) linguistic context

vector which describes the linguistic characteristics associated withsyllable n.

An average context-dependent discrete HMM λ(style)symbolic is estimated

from the pooled speakers observations. Firstly, an average context-dependent tree T(style)

symbolic is derived so as to minimize the infor-

mation entropy of the prosodic symbolic labels qproso condition-ally to the linguistic contexts q . Then, a context-dependent HMMmodel λ

(style)symbolic is estimated for each terminal node of the context-

dependent tree T(style)symbolic.

3.1.2. Continuous HMM

For each speaking style, an average acoustic model λ(style)acoustic that

includes source/filter variations, f0 variations, and state-durations,is estimated from the pooled speakers associated with the speakingstyle based on the conventional HTS system ([14]).

Let R be the number of speakers from which an average model is tobe estimated. Let o = (o(1), . . . ,o(R)) the total set of observations,and o(r) = [o(r)(1), . . . ,o(r)(Tr)] is the observation sequencesassociated with speaker r, where o(r)(t) = [o

(r)t (1), . . . , o

(r)t (D)]�

is the (Dx1) observation vector which describes the acoustical prop-erty at time t. Let q = (q(1), . . . ,q(R)) the total set of linguisticcontexts observations, and q(r) = [q(r)(1), . . . ,q(r)(Tr)] isthe linguistic context sequence associated with speaker r, whereq(r)(t) = [q

(r)1 (t), . . . , q

(r)L (t)]� is the (L’x1) augmented linguistic

context vector which describes the linguistic properties at time t.

An average context-dependent HMM acoustic model λ(style)symbolic

is estimated from the pooled speakers observations. Firstly, acontext-dependent HMM model is estimated for each of thelinguistic contexts. Then, an average context-dependent treeT(style)

acoustic is derived so as to minimize the description length of thecontext-dependent HMM model λ

(style)acoustic.

The acoustic module models simultaneously source/filter variations,f0 variations, and the temporal symbolic associated with a speak-ing style. Speakers f0 were normalized with respect to the speakingstyle prior to modelling. Source, filter, and normalized f0 observa-tion vectors and their dynamic vectors are used to estimate context-dependent HMM models λ

(style)acoustic. Context-dependent HMMs are

clustered into acoustically similar models using decision-tree-basedcontext-clustering (ML-MDL [15]). Multi-Space probability Distri-butions (MSD) [16] are used to model continuous/discrete parame-ter f0 sequence to manage voiced/unvoiced regions properly. Eachcontext-dependent HMM is modelled with a state duration probabil-ity density functions (PDFs) to account for the temporal structure ofspeech [17]. Finally, speech dynamic is modelled according to thetrajectory model and the global variance (GV) that model local andglobal speech variations over time [?].

3.2. Generation of the Speech Parameters

During the synthesis, the text is first converted into a concatenatedsequence of context-dependent HMM models λ

(style)symbolic associated

with the linguistic context sequence q = [q1, . . . ,qN ], whereqn = [q1, . . . , qL]� denotes the (Lx1) linguistic context vectorassociated with linguistic unit n.

Firstly, the prosodic symbolic qproso is inferred so as to maximizethe log-likelihood of the prosodic symbolic sequence qproso condi-tionally to the linguistic context sequence q and the model λ(style)

symbolic.

qproso = argmaxqproso

P(qproso|q, λ(style)symbolic) (2)

• journal: radio review (press review; political, economical, tech-nological chronicles); almost single speaker monologue with afew interactions with a lead journalist.

• sport commentary: soccer; two speakers engaged in mono-logues with speech overlapping during intense soccer sequencesand speech turn changes; almost no interactions.

The speech database consists of natural speech multi-media audiocontents with strongly variable audio quality (background noise:crowd, audience, recording noise, and reverberation). The speechprosody characteristics of the speech databased are illustrated in fig-ure 1.

3. SPEAKING STYLE MODEL

A speaking style model λ(style) is composed of discrete/continuouscontext-dependent HMMs that model the symbolic/acoustic speechcharacteristics of a speaking style.

λ(style) =“λ

(style)symbolic, λ

(style)acoustic

”(1)

During the training, the discrete/continuous context-dependentHMMs are estimated separately. During the synthesis, the sym-bolic/acoustic parameters are generated in cascade, from the sym-bolic representation to the acoustic variations. Additionally, a richlinguistic description of the text characteristics is automatically ex-tracted using a linguistic processing chain [9] and used to refine thecontext-dependent HMM modelling (see [10] and [11] for a detaileddescription of the enriched linguistic contexts).

3.1. Training of the Discrete/Continuous Models

3.1.1. Discrete HMM

For each speaking style, an average context-dependent discreteHMM λ

(style)symbolic is estimated from the pooled speakers associated

with the speaking style.

The prosodic grammar consists of a hierarchical prosodic repre-sentation that was experimented as an alternative to TOBI [12]for French prosody labelling [13]. The prosodic grammar iscomposed of major prosodic boundaries (FM, a boundary whichis right bounded by a pause), minor prosodic boundaries (Fm, anintermediate boundary), and prosodic prominences (P).

Let R be the number of speakers from which an average modelλ

(style)symbolic is to be estimated. Let qproso = (q

(1)proso, . . . ,q

(R)proso)

the total set of prosodic symbolic observations, andq

(r)proso = [q

(r)proso(1), . . . , q

(r)proso(Nr)] is the prosodic sym-

bolic sequence associated with speaker r, where q(r)proso(n)

is the prosodic label associated with syllable n. Letq = (q(1), . . . ,q(R)) the total set of linguistic contextsobservations, and q(r) = [q(r)(1), . . . ,q(r)(Nr)] is the lin-guistic context sequence associated with speaker r, whereq(r)(n) = [q

(r)1 (n), . . . , q

(r)L (n)]� is the (Lx1) linguistic context

vector which describes the linguistic characteristics associated withsyllable n.

An average context-dependent discrete HMM λ(style)symbolic is estimated

from the pooled speakers observations. Firstly, an average context-dependent tree T(style)

symbolic is derived so as to minimize the infor-

mation entropy of the prosodic symbolic labels qproso condition-ally to the linguistic contexts q . Then, a context-dependent HMMmodel λ

(style)symbolic is estimated for each terminal node of the context-

dependent tree T(style)symbolic.

3.1.2. Continuous HMM

For each speaking style, an average acoustic model λ(style)acoustic that

includes source/filter variations, f0 variations, and state-durations,is estimated from the pooled speakers associated with the speakingstyle based on the conventional HTS system ([14]).

Let R be the number of speakers from which an average model is tobe estimated. Let o = (o(1), . . . ,o(R)) the total set of observations,and o(r) = [o(r)(1), . . . ,o(r)(Tr)] is the observation sequencesassociated with speaker r, where o(r)(t) = [o

(r)t (1), . . . , o

(r)t (D)]�

is the (Dx1) observation vector which describes the acoustical prop-erty at time t. Let q = (q(1), . . . ,q(R)) the total set of linguisticcontexts observations, and q(r) = [q(r)(1), . . . ,q(r)(Tr)] isthe linguistic context sequence associated with speaker r, whereq(r)(t) = [q

(r)1 (t), . . . , q

(r)L (t)]� is the (L’x1) augmented linguistic

context vector which describes the linguistic properties at time t.

An average context-dependent HMM acoustic model λ(style)symbolic

is estimated from the pooled speakers observations. Firstly, acontext-dependent HMM model is estimated for each of thelinguistic contexts. Then, an average context-dependent treeT(style)

acoustic is derived so as to minimize the description length of thecontext-dependent HMM model λ

(style)acoustic.

The acoustic module models simultaneously source/filter variations,f0 variations, and the temporal symbolic associated with a speak-ing style. Speakers f0 were normalized with respect to the speakingstyle prior to modelling. Source, filter, and normalized f0 observa-tion vectors and their dynamic vectors are used to estimate context-dependent HMM models λ

(style)acoustic. Context-dependent HMMs are

clustered into acoustically similar models using decision-tree-basedcontext-clustering (ML-MDL [15]). Multi-Space probability Distri-butions (MSD) [16] are used to model continuous/discrete parame-ter f0 sequence to manage voiced/unvoiced regions properly. Eachcontext-dependent HMM is modelled with a state duration probabil-ity density functions (PDFs) to account for the temporal structure ofspeech [17]. Finally, speech dynamic is modelled according to thetrajectory model and the global variance (GV) that model local andglobal speech variations over time [?].

3.2. Generation of the Speech Parameters

During the synthesis, the text is first converted into a concatenatedsequence of context-dependent HMM models λ

(style)symbolic associated

with the linguistic context sequence q = [q1, . . . ,qN ], whereqn = [q1, . . . , qL]� denotes the (Lx1) linguistic context vectorassociated with linguistic unit n.

Firstly, the prosodic symbolic qproso is inferred so as to maximizethe log-likelihood of the prosodic symbolic sequence qproso condi-tionally to the linguistic context sequence q and the model λ(style)

symbolic.

qproso = argmaxqproso

P(qproso|q, λ(style)symbolic) (2)

52CHAPTER 4. SPEECH PROSODY MODELLING & SYNTHESIS:

STATE-OF-THE-ART

A phonetizer is used to convert the input text into a sequence of phonemes. A syllabifieris used to merge the sequence phonemes into a sequence of syllables. At the prosodic level,pauses are identified to

4.6 Abstract Model: Text To Prosodic Structure

The prosodic structure model is to infer a sequence of prosodic labels that are associatedwith relevant prosodic events.

sentence Longtemps , je me suis couche de bonne heure .

⇓prosodicstructureFM * *Fm * * *P * * * *

syllable Long- temps ## je me suis cou- che de bonne heure ##

Table 4.2: Illustration of the text-to-prosodic-structure conversion.

Two main approaches can be distinguished in prosodic structure modelling: on the onehand, expert approaches attempt at elaborating formal models that account for the observedprosodic variations with respect to linguistic, para-linguistic, and extra-linguistic con-straints. On the other hand, statistical methods attempt at elaborating a statistical modelwhich accounts for the prosodic variations from the observation of statistical regularities onlarge speech corpora.

4.6.1 Expert Models

Expert approaches mostly concern hierarchical prosodic structure mod-elling, and in particular prosodic frontiers ([Cooper and Paccia-Cooper, 1980,Gee and Grosjean, 1983, Ferreira, 1988, Abney, 1992, Watson and Gibson, 2004]for English; [Dell, 1984, Bailly, 1989, Monnin and Grosjean, 1993, Ladd, 1996,Delais-Roussarie, 2000, Mertens, 2004b] for French, [Barbosa, 2006] for some otherlanguages). Expert models assume that a prosodic structure results from the integrationof various and potentially conflictual constraints, in particular syntactic and rhythmic[?, Dell, 1984] constraints.The linguistic module mostly concerns the extraction of prominent syn-tactic boundaries from deep syntactic parsing, based on syntactic con-stituency (Constituent-Depth [Cooper and Paccia-Cooper, 1980], φ-phrases[Gee and Grosjean, 1983, Delais-Roussarie, 2000], Left-hand-side / Right-hand-sideBoundary [Watson and Gibson, 2004]), syntactic dependency (Dependency-Grammar-basedtime [s]

ampl

itud

e

!0.4

!0.2

0

0.2

0.4

40

60

80

100

120

f 0[H

z]

l

o

ta Z

@m@

sH ik

uS

ed2b O n 9 R

time [s]

ampl

itude

!0.4

!0.2

0

0.2

0.4

40

60

80

100

120

f 0[H

z]

l

o

ta Z

@m@

sH ik

uS

ed2b O n 9 R

52CHAPTER 4. SPEECH PROSODY MODELLING & SYNTHESIS:

STATE-OF-THE-ART

A phonetizer is used to convert the input text into a sequence of phonemes. A syllabifieris used to merge the sequence phonemes into a sequence of syllables. At the prosodic level,pauses are identified to

4.6 Abstract Model: Text To Prosodic Structure

The prosodic structure model is to infer a sequence of prosodic labels that are associatedwith relevant prosodic events.

sentence Longtemps , je me suis couche de bonne heure .

⇓prosodicstructureFM * *Fm * * *P * * * *

syllable Long- temps ## je me suis cou- che de bonne heure ##

Table 4.2: Illustration of the text-to-prosodic-structure conversion.

Two main approaches can be distinguished in prosodic structure modelling: on the onehand, expert approaches attempt at elaborating formal models that account for the observedprosodic variations with respect to linguistic, para-linguistic, and extra-linguistic con-straints. On the other hand, statistical methods attempt at elaborating a statistical modelwhich accounts for the prosodic variations from the observation of statistical regularities onlarge speech corpora.

4.6.1 Expert Models

Expert approaches mostly concern hierarchical prosodic structure mod-elling, and in particular prosodic frontiers ([Cooper and Paccia-Cooper, 1980,Gee and Grosjean, 1983, Ferreira, 1988, Abney, 1992, Watson and Gibson, 2004]for English; [Dell, 1984, Bailly, 1989, Monnin and Grosjean, 1993, Ladd, 1996,Delais-Roussarie, 2000, Mertens, 2004b] for French, [Barbosa, 2006] for some otherlanguages). Expert models assume that a prosodic structure results from the integrationof various and potentially conflictual constraints, in particular syntactic and rhythmic[?, Dell, 1984] constraints.The linguistic module mostly concerns the extraction of prominent syn-tactic boundaries from deep syntactic parsing, based on syntactic con-stituency (Constituent-Depth [Cooper and Paccia-Cooper, 1980], φ-phrases[Gee and Grosjean, 1983, Delais-Roussarie, 2000], Left-hand-side / Right-hand-sideBoundary [Watson and Gibson, 2004]), syntactic dependency (Dependency-Grammar-based

52CHAPTER 4. SPEECH PROSODY MODELLING & SYNTHESIS:

STATE-OF-THE-ART

A phonetizer is used to convert the input text into a sequence of phonemes. A syllabifieris used to merge the sequence phonemes into a sequence of syllables. At the prosodic level,pauses are identified to

4.6 Abstract Model: Text To Prosodic Structure

The prosodic structure model is to infer a sequence of prosodic labels that are associatedwith relevant prosodic events.

sentence Longtemps , je me suis couche de bonne heure .

⇓prosodicstructureFM * *Fm * * *P * * * *

syllable Long- temps ## je me suis cou- che de bonne heure ##

Table 4.2: Illustration of the text-to-prosodic-structure conversion.

Two main approaches can be distinguished in prosodic structure modelling: on the onehand, expert approaches attempt at elaborating formal models that account for the observedprosodic variations with respect to linguistic, para-linguistic, and extra-linguistic con-straints. On the other hand, statistical methods attempt at elaborating a statistical modelwhich accounts for the prosodic variations from the observation of statistical regularities onlarge speech corpora.

4.6.1 Expert Models

Expert approaches mostly concern hierarchical prosodic structure mod-elling, and in particular prosodic frontiers ([Cooper and Paccia-Cooper, 1980,Gee and Grosjean, 1983, Ferreira, 1988, Abney, 1992, Watson and Gibson, 2004]for English; [Dell, 1984, Bailly, 1989, Monnin and Grosjean, 1993, Ladd, 1996,Delais-Roussarie, 2000, Mertens, 2004b] for French, [Barbosa, 2006] for some otherlanguages). Expert models assume that a prosodic structure results from the integrationof various and potentially conflictual constraints, in particular syntactic and rhythmic[?, Dell, 1984] constraints.The linguistic module mostly concerns the extraction of prominent syn-tactic boundaries from deep syntactic parsing, based on syntactic con-stituency (Constituent-Depth [Cooper and Paccia-Cooper, 1980], φ-phrases[Gee and Grosjean, 1983, Delais-Roussarie, 2000], Left-hand-side / Right-hand-sideBoundary [Watson and Gibson, 2004]), syntactic dependency (Dependency-Grammar-based

52CHAPTER 4. SPEECH PROSODY MODELLING & SYNTHESIS:

STATE-OF-THE-ART

A phonetizer is used to convert the input text into a sequence of phonemes. A syllabifieris used to merge the sequence phonemes into a sequence of syllables. At the prosodic level,pauses are identified to

4.6 Abstract Model: Text To Prosodic Structure

The prosodic structure model is to infer a sequence of prosodic labels that are associatedwith relevant prosodic events.

sentence Longtemps , je me suis couche de bonne heure .

⇓prosodicstructureFM * *Fm * * *P * * * *

syllable Long- temps ## je me suis cou- che de bonne heure ##

Table 4.2: Illustration of the text-to-prosodic-structure conversion.

Two main approaches can be distinguished in prosodic structure modelling: on the onehand, expert approaches attempt at elaborating formal models that account for the observedprosodic variations with respect to linguistic, para-linguistic, and extra-linguistic con-straints. On the other hand, statistical methods attempt at elaborating a statistical modelwhich accounts for the prosodic variations from the observation of statistical regularities onlarge speech corpora.

4.6.1 Expert Models

Expert approaches mostly concern hierarchical prosodic structure mod-elling, and in particular prosodic frontiers ([Cooper and Paccia-Cooper, 1980,Gee and Grosjean, 1983, Ferreira, 1988, Abney, 1992, Watson and Gibson, 2004]for English; [Dell, 1984, Bailly, 1989, Monnin and Grosjean, 1993, Ladd, 1996,Delais-Roussarie, 2000, Mertens, 2004b] for French, [Barbosa, 2006] for some otherlanguages). Expert models assume that a prosodic structure results from the integrationof various and potentially conflictual constraints, in particular syntactic and rhythmic[?, Dell, 1984] constraints.The linguistic module mostly concerns the extraction of prominent syn-tactic boundaries from deep syntactic parsing, based on syntactic con-stituency (Constituent-Depth [Cooper and Paccia-Cooper, 1980], φ-phrases[Gee and Grosjean, 1983, Delais-Roussarie, 2000], Left-hand-side / Right-hand-sideBoundary [Watson and Gibson, 2004]), syntactic dependency (Dependency-Grammar-based

52CHAPTER 4. SPEECH PROSODY MODELLING & SYNTHESIS:

STATE-OF-THE-ART

A phonetizer is used to convert the input text into a sequence of phonemes. A syllabifieris used to merge the sequence phonemes into a sequence of syllables. At the prosodic level,pauses are identified to

4.6 Abstract Model: Text To Prosodic Structure

The prosodic structure model is to infer a sequence of prosodic labels that are associatedwith relevant prosodic events.

sentence Longtemps , je me suis couche de bonne heure .

⇓prosodicstructureFM * *Fm * * *P * * * *

syllable Long- temps ## je me suis cou- che de bonne heure ##

Table 4.2: Illustration of the text-to-prosodic-structure conversion.

Two main approaches can be distinguished in prosodic structure modelling: on the onehand, expert approaches attempt at elaborating formal models that account for the observedprosodic variations with respect to linguistic, para-linguistic, and extra-linguistic con-straints. On the other hand, statistical methods attempt at elaborating a statistical modelwhich accounts for the prosodic variations from the observation of statistical regularities onlarge speech corpora.

4.6.1 Expert Models

Expert approaches mostly concern hierarchical prosodic structure mod-elling, and in particular prosodic frontiers ([Cooper and Paccia-Cooper, 1980,Gee and Grosjean, 1983, Ferreira, 1988, Abney, 1992, Watson and Gibson, 2004]for English; [Dell, 1984, Bailly, 1989, Monnin and Grosjean, 1993, Ladd, 1996,Delais-Roussarie, 2000, Mertens, 2004b] for French, [Barbosa, 2006] for some otherlanguages). Expert models assume that a prosodic structure results from the integrationof various and potentially conflictual constraints, in particular syntactic and rhythmic[?, Dell, 1984] constraints.The linguistic module mostly concerns the extraction of prominent syn-tactic boundaries from deep syntactic parsing, based on syntactic con-stituency (Constituent-Depth [Cooper and Paccia-Cooper, 1980], φ-phrases[Gee and Grosjean, 1983, Delais-Roussarie, 2000], Left-hand-side / Right-hand-sideBoundary [Watson and Gibson, 2004]), syntactic dependency (Dependency-Grammar-based

• journal: radio review (press review; political, economical, tech-nological chronicles); almost single speaker monologue with afew interactions with a lead journalist.

• sport commentary: soccer; two speakers engaged in mono-logues with speech overlapping during intense soccer sequencesand speech turn changes; almost no interactions.

The speech database consists of natural speech multi-media audiocontents with strongly variable audio quality (background noise:crowd, audience, recording noise, and reverberation). The speechprosody characteristics of the speech databased are illustrated in fig-ure 1.

3. SPEAKING STYLE MODEL

A speaking style model λ(style) is composed of discrete/continuouscontext-dependent HMMs that model the symbolic/acoustic speechcharacteristics of a speaking style.

λ(style) =“λ

(style)symbolic, λ

(style)acoustic

”(1)

During the training, the discrete/continuous context-dependentHMMs are estimated separately. During the synthesis, the sym-bolic/acoustic parameters are generated in cascade, from the sym-bolic representation to the acoustic variations. Additionally, a richlinguistic description of the text characteristics is automatically ex-tracted using a linguistic processing chain [9] and used to refine thecontext-dependent HMM modelling (see [10] and [11] for a detaileddescription of the enriched linguistic contexts).

3.1. Training of the Discrete/Continuous Models

3.1.1. Discrete HMM

For each speaking style, an average context-dependent discreteHMM λ

(style)symbolic is estimated from the pooled speakers associated

with the speaking style.

The prosodic grammar consists of a hierarchical prosodic repre-sentation that was experimented as an alternative to TOBI [12]for French prosody labelling [13]. The prosodic grammar iscomposed of major prosodic boundaries (FM, a boundary whichis right bounded by a pause), minor prosodic boundaries (Fm, anintermediate boundary), and prosodic prominences (P).

Let R be the number of speakers from which an average modelλ

(style)symbolic is to be estimated. Let qproso = (q

(1)proso, . . . ,q

(R)proso)

the total set of prosodic symbolic observations, andq

(r)proso = [q

(r)proso(1), . . . , q

(r)proso(Nr)] is the prosodic sym-

bolic sequence associated with speaker r, where q(r)proso(n)

is the prosodic label associated with syllable n. Letq = (q(1), . . . ,q(R)) the total set of linguistic contextsobservations, and q(r) = [q(r)(1), . . . ,q(r)(Nr)] is the lin-guistic context sequence associated with speaker r, whereq(r)(n) = [q

(r)1 (n), . . . , q

(r)L (n)]� is the (Lx1) linguistic context

vector which describes the linguistic characteristics associated withsyllable n.

An average context-dependent discrete HMM λ(style)symbolic is estimated

from the pooled speakers observations. Firstly, an average context-dependent tree T(style)

symbolic is derived so as to minimize the infor-

mation entropy of the prosodic symbolic labels qproso condition-ally to the linguistic contexts q . Then, a context-dependent HMMmodel λ

(style)symbolic is estimated for each terminal node of the context-

dependent tree T(style)symbolic.

3.1.2. Continuous HMM

For each speaking style, an average acoustic model λ(style)acoustic that

includes source/filter variations, f0 variations, and state-durations,is estimated from the pooled speakers associated with the speakingstyle based on the conventional HTS system ([14]).

Let R be the number of speakers from which an average model is tobe estimated. Let o = (o(1), . . . ,o(R)) the total set of observations,and o(r) = [o(r)(1), . . . ,o(r)(Tr)] is the observation sequencesassociated with speaker r, where o(r)(t) = [o

(r)t (1), . . . , o

(r)t (D)]�

is the (Dx1) observation vector which describes the acoustical prop-erty at time t. Let q = (q(1), . . . ,q(R)) the total set of linguisticcontexts observations, and q(r) = [q(r)(1), . . . ,q(r)(Tr)] isthe linguistic context sequence associated with speaker r, whereq(r)(t) = [q

(r)1 (t), . . . , q

(r)L (t)]� is the (L’x1) augmented linguistic

context vector which describes the linguistic properties at time t.

An average context-dependent HMM acoustic model λ(style)symbolic

is estimated from the pooled speakers observations. Firstly, acontext-dependent HMM model is estimated for each of thelinguistic contexts. Then, an average context-dependent treeT(style)

acoustic is derived so as to minimize the description length of thecontext-dependent HMM model λ

(style)acoustic.

The acoustic module models simultaneously source/filter variations,f0 variations, and the temporal symbolic associated with a speak-ing style. Speakers f0 were normalized with respect to the speakingstyle prior to modelling. Source, filter, and normalized f0 observa-tion vectors and their dynamic vectors are used to estimate context-dependent HMM models λ

(style)acoustic. Context-dependent HMMs are

clustered into acoustically similar models using decision-tree-basedcontext-clustering (ML-MDL [15]). Multi-Space probability Distri-butions (MSD) [16] are used to model continuous/discrete parame-ter f0 sequence to manage voiced/unvoiced regions properly. Eachcontext-dependent HMM is modelled with a state duration probabil-ity density functions (PDFs) to account for the temporal structure ofspeech [17]. Finally, speech dynamic is modelled according to thetrajectory model and the global variance (GV) that model local andglobal speech variations over time [?].

3.2. Generation of the Speech Parameters

During the synthesis, the text is first converted into a concatenatedsequence of context-dependent HMM models λ

(style)symbolic associated

with the linguistic context sequence q = [q1, . . . ,qN ], whereqn = [q1, . . . , qL]� denotes the (Lx1) linguistic context vectorassociated with linguistic unit n.

Firstly, the prosodic symbolic qproso is inferred so as to maximizethe log-likelihood of the prosodic symbolic sequence qproso condi-tionally to the linguistic context sequence q and the model λ(style)

symbolic.

qproso = argmaxqproso

P(qproso|q, λ(style)symbolic) (2)

• journal: radio review (press review; political, economical, tech-nological chronicles); almost single speaker monologue with afew interactions with a lead journalist.

• sport commentary: soccer; two speakers engaged in mono-logues with speech overlapping during intense soccer sequencesand speech turn changes; almost no interactions.

The speech database consists of natural speech multi-media audiocontents with strongly variable audio quality (background noise:crowd, audience, recording noise, and reverberation). The speechprosody characteristics of the speech databased are illustrated in fig-ure 1.

3. SPEAKING STYLE MODEL

A speaking style model λ(style) is composed of discrete/continuouscontext-dependent HMMs that model the symbolic/acoustic speechcharacteristics of a speaking style.

λ(style) =“λ

(style)symbolic, λ

(style)acoustic

”(1)

During the training, the discrete/continuous context-dependentHMMs are estimated separately. During the synthesis, the sym-bolic/acoustic parameters are generated in cascade, from the sym-bolic representation to the acoustic variations. Additionally, a richlinguistic description of the text characteristics is automatically ex-tracted using a linguistic processing chain [9] and used to refine thecontext-dependent HMM modelling (see [10] and [11] for a detaileddescription of the enriched linguistic contexts).

3.1. Training of the Discrete/Continuous Models

3.1.1. Discrete HMM

For each speaking style, an average context-dependent discreteHMM λ

(style)symbolic is estimated from the pooled speakers associated

with the speaking style.

The prosodic grammar consists of a hierarchical prosodic repre-sentation that was experimented as an alternative to TOBI [12]for French prosody labelling [13]. The prosodic grammar iscomposed of major prosodic boundaries (FM, a boundary whichis right bounded by a pause), minor prosodic boundaries (Fm, anintermediate boundary), and prosodic prominences (P).

Let R be the number of speakers from which an average modelλ

(style)symbolic is to be estimated. Let qproso = (q

(1)proso, . . . ,q

(R)proso)

the total set of prosodic symbolic observations, andq

(r)proso = [q

(r)proso(1), . . . , q

(r)proso(Nr)] is the prosodic sym-

bolic sequence associated with speaker r, where q(r)proso(n)

is the prosodic label associated with syllable n. Letq = (q(1), . . . ,q(R)) the total set of linguistic contextsobservations, and q(r) = [q(r)(1), . . . ,q(r)(Nr)] is the lin-guistic context sequence associated with speaker r, whereq(r)(n) = [q

(r)1 (n), . . . , q

(r)L (n)]� is the (Lx1) linguistic context

vector which describes the linguistic characteristics associated withsyllable n.

An average context-dependent discrete HMM λ(style)symbolic is estimated

from the pooled speakers observations. Firstly, an average context-dependent tree T(style)

symbolic is derived so as to minimize the infor-

mation entropy of the prosodic symbolic labels qproso condition-ally to the linguistic contexts q . Then, a context-dependent HMMmodel λ

(style)symbolic is estimated for each terminal node of the context-

dependent tree T(style)symbolic.

3.1.2. Continuous HMM

For each speaking style, an average acoustic model λ(style)acoustic that

includes source/filter variations, f0 variations, and state-durations,is estimated from the pooled speakers associated with the speakingstyle based on the conventional HTS system ([14]).

Let R be the number of speakers from which an average model is tobe estimated. Let o = (o(1), . . . ,o(R)) the total set of observations,and o(r) = [o(r)(1), . . . ,o(r)(Tr)] is the observation sequencesassociated with speaker r, where o(r)(t) = [o

(r)t (1), . . . , o

(r)t (D)]�

is the (Dx1) observation vector which describes the acoustical prop-erty at time t. Let q = (q(1), . . . ,q(R)) the total set of linguisticcontexts observations, and q(r) = [q(r)(1), . . . ,q(r)(Tr)] isthe linguistic context sequence associated with speaker r, whereq(r)(t) = [q

(r)1 (t), . . . , q

(r)L (t)]� is the (L’x1) augmented linguistic

context vector which describes the linguistic properties at time t.

An average context-dependent HMM acoustic model λ(style)symbolic

is estimated from the pooled speakers observations. Firstly, acontext-dependent HMM model is estimated for each of thelinguistic contexts. Then, an average context-dependent treeT(style)

acoustic is derived so as to minimize the description length of thecontext-dependent HMM model λ

(style)acoustic.

The acoustic module models simultaneously source/filter variations,f0 variations, and the temporal symbolic associated with a speak-ing style. Speakers f0 were normalized with respect to the speakingstyle prior to modelling. Source, filter, and normalized f0 observa-tion vectors and their dynamic vectors are used to estimate context-dependent HMM models λ

(style)acoustic. Context-dependent HMMs are

clustered into acoustically similar models using decision-tree-basedcontext-clustering (ML-MDL [15]). Multi-Space probability Distri-butions (MSD) [16] are used to model continuous/discrete parame-ter f0 sequence to manage voiced/unvoiced regions properly. Eachcontext-dependent HMM is modelled with a state duration probabil-ity density functions (PDFs) to account for the temporal structure ofspeech [17]. Finally, speech dynamic is modelled according to thetrajectory model and the global variance (GV) that model local andglobal speech variations over time [?].

3.2. Generation of the Speech Parameters

During the synthesis, the text is first converted into a concatenatedsequence of context-dependent HMM models λ

(style)symbolic associated

with the linguistic context sequence q = [q1, . . . ,qN ], whereqn = [q1, . . . , qL]� denotes the (Lx1) linguistic context vectorassociated with linguistic unit n.

Firstly, the prosodic symbolic qproso is inferred so as to maximizethe log-likelihood of the prosodic symbolic sequence qproso condi-tionally to the linguistic context sequence q and the model λ(style)

symbolic.

qproso = argmaxqproso

P(qproso|q, λ(style)symbolic) (2)

Then, the linguistic context sequence q augmented with the inferredprosodic label sequence qproso is converted into a concatenatedsequence of context-dependent models λ

(style)acoustic.

The acoustic sequence o is inferred so as to maximize the log-likelihood of the acoustic sequence o conditionally to the modelλ

(style)acoustic and the sequence length T .

o = argmaxo

maxq

P(o|q, λ(style)acoustic, T )P(q, λ

(style)acoustic, T )(3)

First, the state sequence q is determined so as to maximize the log-likelihood of the state sequence conditionally to the model λ

(style)acoustic

and the sequence length T . Then, the observation sequence c isdetermined so as to maximize the log-likelihood of the observationsequence conditionnally to the state sequence q, the model λ

(style)acoustic

under dynamic constraint o = Wc.

Rqc = rq (4)

where:

Rq = W�Σ−1q W. (5)

rq = W�Σ−1q µq. (6)

and Σq and µq are respectively the covariance matrix and the meanvector for the sate sequence q.

SPEECH SYNTHESIS

4. EVALUATION

The proposed model has been evaluatedon a speaking style identification percep-tual experiment basis, and compared toa speaking style identification experimentwith natural speech [18]. For the purposeof such a comparison, it was necessaryto provide a single evaluation scheme forboth experiments. In particular, it was notpossible to control the linguistic content ofnatural speech utterances which providesevident cues for DG’s identification (a sin-gle keyword would be sufficient to identifya DG). Thus, such a comparison requiredto remove lexical access and to focus onthe prosodic dimension only.

52CHAPTER 4. SPEECH PROSODY MODELLING & SYNTHESIS:

STATE-OF-THE-ART

A phonetizer is used to convert the input text into a sequence of phonemes. A syllabifieris used to merge the sequence phonemes into a sequence of syllables. At the prosodic level,pauses are identified to

4.6 Abstract Model: Text To Prosodic Structure

The prosodic structure model is to infer a sequence of prosodic labels that are associatedwith relevant prosodic events.

sentence Longtemps , je me suis couche de bonne heure .

⇓prosodicstructureFM * *Fm * * *P * * * *

syllable Long- temps ## je me suis cou- che de bonne heure ##

Table 4.2: Illustration of the text-to-prosodic-structure conversion.

Two main approaches can be distinguished in prosodic structure modelling: on the onehand, expert approaches attempt at elaborating formal models that account for the observedprosodic variations with respect to linguistic, para-linguistic, and extra-linguistic con-straints. On the other hand, statistical methods attempt at elaborating a statistical modelwhich accounts for the prosodic variations from the observation of statistical regularities onlarge speech corpora.

4.6.1 Expert Models

Expert approaches mostly concern hierarchical prosodic structure mod-elling, and in particular prosodic frontiers ([Cooper and Paccia-Cooper, 1980,Gee and Grosjean, 1983, Ferreira, 1988, Abney, 1992, Watson and Gibson, 2004]for English; [Dell, 1984, Bailly, 1989, Monnin and Grosjean, 1993, Ladd, 1996,Delais-Roussarie, 2000, Mertens, 2004b] for French, [Barbosa, 2006] for some otherlanguages). Expert models assume that a prosodic structure results from the integrationof various and potentially conflictual constraints, in particular syntactic and rhythmic[?, Dell, 1984] constraints.The linguistic module mostly concerns the extraction of prominent syn-tactic boundaries from deep syntactic parsing, based on syntactic con-stituency (Constituent-Depth [Cooper and Paccia-Cooper, 1980], φ-phrases[Gee and Grosjean, 1983, Delais-Roussarie, 2000], Left-hand-side / Right-hand-sideBoundary [Watson and Gibson, 2004]), syntactic dependency (Dependency-Grammar-based

52CHAPTER 4. SPEECH PROSODY MODELLING & SYNTHESIS:

STATE-OF-THE-ART

A phonetizer is used to convert the input text into a sequence of phonemes. A syllabifieris used to merge the sequence phonemes into a sequence of syllables. At the prosodic level,pauses are identified to

4.6 Abstract Model: Text To Prosodic Structure

The prosodic structure model is to infer a sequence of prosodic labels that are associatedwith relevant prosodic events.

sentence Longtemps , je me suis couche de bonne heure .

⇓prosodicstructureFM * *Fm * * *P * * * *

syllable Long- temps ## je me suis cou- che de bonne heure ##

Table 4.2: Illustration of the text-to-prosodic-structure conversion.

Two main approaches can be distinguished in prosodic structure modelling: on the onehand, expert approaches attempt at elaborating formal models that account for the observedprosodic variations with respect to linguistic, para-linguistic, and extra-linguistic con-straints. On the other hand, statistical methods attempt at elaborating a statistical modelwhich accounts for the prosodic variations from the observation of statistical regularities onlarge speech corpora.

4.6.1 Expert Models

Expert approaches mostly concern hierarchical prosodic structure mod-elling, and in particular prosodic frontiers ([Cooper and Paccia-Cooper, 1980,Gee and Grosjean, 1983, Ferreira, 1988, Abney, 1992, Watson and Gibson, 2004]for English; [Dell, 1984, Bailly, 1989, Monnin and Grosjean, 1993, Ladd, 1996,Delais-Roussarie, 2000, Mertens, 2004b] for French, [Barbosa, 2006] for some otherlanguages). Expert models assume that a prosodic structure results from the integrationof various and potentially conflictual constraints, in particular syntactic and rhythmic[?, Dell, 1984] constraints.The linguistic module mostly concerns the extraction of prominent syn-tactic boundaries from deep syntactic parsing, based on syntactic con-stituency (Constituent-Depth [Cooper and Paccia-Cooper, 1980], φ-phrases[Gee and Grosjean, 1983, Delais-Roussarie, 2000], Left-hand-side / Right-hand-sideBoundary [Watson and Gibson, 2004]), syntactic dependency (Dependency-Grammar-based

52CHAPTER 4. SPEECH PROSODY MODELLING & SYNTHESIS:

STATE-OF-THE-ART

A phonetizer is used to convert the input text into a sequence of phonemes. A syllabifieris used to merge the sequence phonemes into a sequence of syllables. At the prosodic level,pauses are identified to

4.6 Abstract Model: Text To Prosodic Structure

The prosodic structure model is to infer a sequence of prosodic labels that are associatedwith relevant prosodic events.

sentence Longtemps , je me suis couche de bonne heure .

⇓prosodicstructureFM * *Fm * * *P * * * *

syllable Long- temps ## je me suis cou- che de bonne heure ##

Table 4.2: Illustration of the text-to-prosodic-structure conversion.

Two main approaches can be distinguished in prosodic structure modelling: on the onehand, expert approaches attempt at elaborating formal models that account for the observedprosodic variations with respect to linguistic, para-linguistic, and extra-linguistic con-straints. On the other hand, statistical methods attempt at elaborating a statistical modelwhich accounts for the prosodic variations from the observation of statistical regularities onlarge speech corpora.

4.6.1 Expert Models

Expert approaches mostly concern hierarchical prosodic structure mod-elling, and in particular prosodic frontiers ([Cooper and Paccia-Cooper, 1980,Gee and Grosjean, 1983, Ferreira, 1988, Abney, 1992, Watson and Gibson, 2004]for English; [Dell, 1984, Bailly, 1989, Monnin and Grosjean, 1993, Ladd, 1996,Delais-Roussarie, 2000, Mertens, 2004b] for French, [Barbosa, 2006] for some otherlanguages). Expert models assume that a prosodic structure results from the integrationof various and potentially conflictual constraints, in particular syntactic and rhythmic[?, Dell, 1984] constraints.The linguistic module mostly concerns the extraction of prominent syn-tactic boundaries from deep syntactic parsing, based on syntactic con-stituency (Constituent-Depth [Cooper and Paccia-Cooper, 1980], φ-phrases[Gee and Grosjean, 1983, Delais-Roussarie, 2000], Left-hand-side / Right-hand-sideBoundary [Watson and Gibson, 2004]), syntactic dependency (Dependency-Grammar-based

52CHAPTER 4. SPEECH PROSODY MODELLING & SYNTHESIS:

STATE-OF-THE-ART

A phonetizer is used to convert the input text into a sequence of phonemes. A syllabifieris used to merge the sequence phonemes into a sequence of syllables. At the prosodic level,pauses are identified to

4.6 Abstract Model: Text To Prosodic Structure

The prosodic structure model is to infer a sequence of prosodic labels that are associatedwith relevant prosodic events.

sentence Longtemps , je me suis couche de bonne heure .

⇓prosodicstructureFM * *Fm * * *P * * * *

syllable Long- temps ## je me suis cou- che de bonne heure ##

Table 4.2: Illustration of the text-to-prosodic-structure conversion.

Two main approaches can be distinguished in prosodic structure modelling: on the onehand, expert approaches attempt at elaborating formal models that account for the observedprosodic variations with respect to linguistic, para-linguistic, and extra-linguistic con-straints. On the other hand, statistical methods attempt at elaborating a statistical modelwhich accounts for the prosodic variations from the observation of statistical regularities onlarge speech corpora.

4.6.1 Expert Models

Expert approaches mostly concern hierarchical prosodic structure mod-elling, and in particular prosodic frontiers ([Cooper and Paccia-Cooper, 1980,Gee and Grosjean, 1983, Ferreira, 1988, Abney, 1992, Watson and Gibson, 2004]for English; [Dell, 1984, Bailly, 1989, Monnin and Grosjean, 1993, Ladd, 1996,Delais-Roussarie, 2000, Mertens, 2004b] for French, [Barbosa, 2006] for some otherlanguages). Expert models assume that a prosodic structure results from the integrationof various and potentially conflictual constraints, in particular syntactic and rhythmic[?, Dell, 1984] constraints.The linguistic module mostly concerns the extraction of prominent syn-tactic boundaries from deep syntactic parsing, based on syntactic con-stituency (Constituent-Depth [Cooper and Paccia-Cooper, 1980], φ-phrases[Gee and Grosjean, 1983, Delais-Roussarie, 2000], Left-hand-side / Right-hand-sideBoundary [Watson and Gibson, 2004]), syntactic dependency (Dependency-Grammar-based

52CHAPTER 4. SPEECH PROSODY MODELLING & SYNTHESIS:

STATE-OF-THE-ART

A phonetizer is used to convert the input text into a sequence of phonemes. A syllabifieris used to merge the sequence phonemes into a sequence of syllables. At the prosodic level,pauses are identified to

4.6 Abstract Model: Text To Prosodic Structure

The prosodic structure model is to infer a sequence of prosodic labels that are associatedwith relevant prosodic events.

sentence Longtemps , je me suis couche de bonne heure .

⇓prosodicstructureFM * *Fm * * *P * * * *

syllable Long- temps ## je me suis cou- che de bonne heure ##

Table 4.2: Illustration of the text-to-prosodic-structure conversion.

Two main approaches can be distinguished in prosodic structure modelling: on the onehand, expert approaches attempt at elaborating formal models that account for the observedprosodic variations with respect to linguistic, para-linguistic, and extra-linguistic con-straints. On the other hand, statistical methods attempt at elaborating a statistical modelwhich accounts for the prosodic variations from the observation of statistical regularities onlarge speech corpora.

4.6.1 Expert Models

Expert approaches mostly concern hierarchical prosodic structure mod-elling, and in particular prosodic frontiers ([Cooper and Paccia-Cooper, 1980,Gee and Grosjean, 1983, Ferreira, 1988, Abney, 1992, Watson and Gibson, 2004]for English; [Dell, 1984, Bailly, 1989, Monnin and Grosjean, 1993, Ladd, 1996,Delais-Roussarie, 2000, Mertens, 2004b] for French, [Barbosa, 2006] for some otherlanguages). Expert models assume that a prosodic structure results from the integrationof various and potentially conflictual constraints, in particular syntactic and rhythmic[?, Dell, 1984] constraints.The linguistic module mostly concerns the extraction of prominent syn-tactic boundaries from deep syntactic parsing, based on syntactic con-stituency (Constituent-Depth [Cooper and Paccia-Cooper, 1980], φ-phrases[Gee and Grosjean, 1983, Delais-Roussarie, 2000], Left-hand-side / Right-hand-sideBoundary [Watson and Gibson, 2004]), syntactic dependency (Dependency-Grammar-based

• journal: radio review (press review; political, economical, tech-nological chronicles); almost single speaker monologue with afew interactions with a lead journalist.

• sport commentary: soccer; two speakers engaged in mono-logues with speech overlapping during intense soccer sequencesand speech turn changes; almost no interactions.

The speech database consists of natural speech multi-media audiocontents with strongly variable audio quality (background noise:crowd, audience, recording noise, and reverberation). The speechprosody characteristics of the speech databased are illustrated in fig-ure 1.

3. SPEAKING STYLE MODEL

A speaking style model λ(style) is composed of discrete/continuouscontext-dependent HMMs that model the symbolic/acoustic speechcharacteristics of a speaking style.

λ(style) =“λ

(style)symbolic, λ

(style)acoustic

”(1)

During the training, the discrete/continuous context-dependentHMMs are estimated separately. During the synthesis, the sym-bolic/acoustic parameters are generated in cascade, from the sym-bolic representation to the acoustic variations. Additionally, a richlinguistic description of the text characteristics is automatically ex-tracted using a linguistic processing chain [9] and used to refine thecontext-dependent HMM modelling (see [10] and [11] for a detaileddescription of the enriched linguistic contexts).

3.1. Training of the Discrete/Continuous Models

3.1.1. Discrete HMM

For each speaking style, an average context-dependent discreteHMM λ

(style)symbolic is estimated from the pooled speakers associated

with the speaking style.

The prosodic grammar consists of a hierarchical prosodic repre-sentation that was experimented as an alternative to TOBI [12]for French prosody labelling [13]. The prosodic grammar iscomposed of major prosodic boundaries (FM, a boundary whichis right bounded by a pause), minor prosodic boundaries (Fm, anintermediate boundary), and prosodic prominences (P).

Let R be the number of speakers from which an average modelλ

(style)symbolic is to be estimated. Let qproso = (q

(1)proso, . . . ,q

(R)proso)

the total set of prosodic symbolic observations, andq

(r)proso = [q

(r)proso(1), . . . , q

(r)proso(Nr)] is the prosodic sym-

bolic sequence associated with speaker r, where q(r)proso(n)

is the prosodic label associated with syllable n. Letq = (q(1), . . . ,q(R)) the total set of linguistic contextsobservations, and q(r) = [q(r)(1), . . . ,q(r)(Nr)] is the lin-guistic context sequence associated with speaker r, whereq(r)(n) = [q

(r)1 (n), . . . , q

(r)L (n)]� is the (Lx1) linguistic context

vector which describes the linguistic characteristics associated withsyllable n.

An average context-dependent discrete HMM λ(style)symbolic is estimated

from the pooled speakers observations. Firstly, an average context-dependent tree T(style)

symbolic is derived so as to minimize the infor-

mation entropy of the prosodic symbolic labels qproso condition-ally to the linguistic contexts q . Then, a context-dependent HMMmodel λ

(style)symbolic is estimated for each terminal node of the context-

dependent tree T(style)symbolic.

3.1.2. Continuous HMM

For each speaking style, an average acoustic model λ(style)acoustic that

includes source/filter variations, f0 variations, and state-durations,is estimated from the pooled speakers associated with the speakingstyle based on the conventional HTS system ([14]).

Let R be the number of speakers from which an average model is tobe estimated. Let o = (o(1), . . . ,o(R)) the total set of observations,and o(r) = [o(r)(1), . . . ,o(r)(Tr)] is the observation sequencesassociated with speaker r, where o(r)(t) = [o

(r)t (1), . . . , o

(r)t (D)]�

is the (Dx1) observation vector which describes the acoustical prop-erty at time t. Let q = (q(1), . . . ,q(R)) the total set of linguisticcontexts observations, and q(r) = [q(r)(1), . . . ,q(r)(Tr)] isthe linguistic context sequence associated with speaker r, whereq(r)(t) = [q

(r)1 (t), . . . , q

(r)L (t)]� is the (L’x1) augmented linguistic

context vector which describes the linguistic properties at time t.

An average context-dependent HMM acoustic model λ(style)symbolic

is estimated from the pooled speakers observations. Firstly, acontext-dependent HMM model is estimated for each of thelinguistic contexts. Then, an average context-dependent treeT(style)

acoustic is derived so as to minimize the description length of thecontext-dependent HMM model λ

(style)acoustic.

The acoustic module models simultaneously source/filter variations,f0 variations, and the temporal symbolic associated with a speak-ing style. Speakers f0 were normalized with respect to the speakingstyle prior to modelling. Source, filter, and normalized f0 observa-tion vectors and their dynamic vectors are used to estimate context-dependent HMM models λ

(style)acoustic. Context-dependent HMMs are

clustered into acoustically similar models using decision-tree-basedcontext-clustering (ML-MDL [15]). Multi-Space probability Distri-butions (MSD) [16] are used to model continuous/discrete parame-ter f0 sequence to manage voiced/unvoiced regions properly. Eachcontext-dependent HMM is modelled with a state duration probabil-ity density functions (PDFs) to account for the temporal structure ofspeech [17]. Finally, speech dynamic is modelled according to thetrajectory model and the global variance (GV) that model local andglobal speech variations over time [?].

3.2. Generation of the Speech Parameters

During the synthesis, the text is first converted into a concatenatedsequence of context-dependent HMM models λ

(style)symbolic associated

with the linguistic context sequence q = [q1, . . . ,qN ], whereqn = [q1, . . . , qL]� denotes the (Lx1) linguistic context vectorassociated with linguistic unit n.

Firstly, the prosodic symbolic qproso is inferred so as to maximizethe log-likelihood of the prosodic symbolic sequence qproso condi-tionally to the linguistic context sequence q and the model λ(style)

symbolic.

qproso = argmaxqproso

P(qproso|q, λ(style)symbolic) (2)

• journal: radio review (press review; political, economical, tech-nological chronicles); almost single speaker monologue with afew interactions with a lead journalist.

• sport commentary: soccer; two speakers engaged in mono-logues with speech overlapping during intense soccer sequencesand speech turn changes; almost no interactions.

The speech database consists of natural speech multi-media audiocontents with strongly variable audio quality (background noise:crowd, audience, recording noise, and reverberation). The speechprosody characteristics of the speech databased are illustrated in fig-ure 1.

3. SPEAKING STYLE MODEL

A speaking style model λ(style) is composed of discrete/continuouscontext-dependent HMMs that model the symbolic/acoustic speechcharacteristics of a speaking style.

λ(style) =“λ

(style)symbolic, λ

(style)acoustic

”(1)

During the training, the discrete/continuous context-dependentHMMs are estimated separately. During the synthesis, the sym-bolic/acoustic parameters are generated in cascade, from the sym-bolic representation to the acoustic variations. Additionally, a richlinguistic description of the text characteristics is automatically ex-tracted using a linguistic processing chain [9] and used to refine thecontext-dependent HMM modelling (see [10] and [11] for a detaileddescription of the enriched linguistic contexts).

3.1. Training of the Discrete/Continuous Models

3.1.1. Discrete HMM

For each speaking style, an average context-dependent discreteHMM λ

(style)symbolic is estimated from the pooled speakers associated

with the speaking style.

The prosodic grammar consists of a hierarchical prosodic repre-sentation that was experimented as an alternative to TOBI [12]for French prosody labelling [13]. The prosodic grammar iscomposed of major prosodic boundaries (FM, a boundary whichis right bounded by a pause), minor prosodic boundaries (Fm, anintermediate boundary), and prosodic prominences (P).

Let R be the number of speakers from which an average modelλ

(style)symbolic is to be estimated. Let qproso = (q

(1)proso, . . . ,q

(R)proso)

the total set of prosodic symbolic observations, andq

(r)proso = [q

(r)proso(1), . . . , q

(r)proso(Nr)] is the prosodic sym-

bolic sequence associated with speaker r, where q(r)proso(n)

is the prosodic label associated with syllable n. Letq = (q(1), . . . ,q(R)) the total set of linguistic contextsobservations, and q(r) = [q(r)(1), . . . ,q(r)(Nr)] is the lin-guistic context sequence associated with speaker r, whereq(r)(n) = [q

(r)1 (n), . . . , q

(r)L (n)]� is the (Lx1) linguistic context

vector which describes the linguistic characteristics associated withsyllable n.

An average context-dependent discrete HMM λ(style)symbolic is estimated

from the pooled speakers observations. Firstly, an average context-dependent tree T(style)

symbolic is derived so as to minimize the infor-

mation entropy of the prosodic symbolic labels qproso condition-ally to the linguistic contexts q . Then, a context-dependent HMMmodel λ

(style)symbolic is estimated for each terminal node of the context-

dependent tree T(style)symbolic.

3.1.2. Continuous HMM

For each speaking style, an average acoustic model λ(style)acoustic that

includes source/filter variations, f0 variations, and state-durations,is estimated from the pooled speakers associated with the speakingstyle based on the conventional HTS system ([14]).

Let R be the number of speakers from which an average model is tobe estimated. Let o = (o(1), . . . ,o(R)) the total set of observations,and o(r) = [o(r)(1), . . . ,o(r)(Tr)] is the observation sequencesassociated with speaker r, where o(r)(t) = [o

(r)t (1), . . . , o

(r)t (D)]�

is the (Dx1) observation vector which describes the acoustical prop-erty at time t. Let q = (q(1), . . . ,q(R)) the total set of linguisticcontexts observations, and q(r) = [q(r)(1), . . . ,q(r)(Tr)] isthe linguistic context sequence associated with speaker r, whereq(r)(t) = [q

(r)1 (t), . . . , q

(r)L (t)]� is the (L’x1) augmented linguistic

context vector which describes the linguistic properties at time t.

An average context-dependent HMM acoustic model λ(style)symbolic

is estimated from the pooled speakers observations. Firstly, acontext-dependent HMM model is estimated for each of thelinguistic contexts. Then, an average context-dependent treeT(style)

acoustic is derived so as to minimize the description length of thecontext-dependent HMM model λ

(style)acoustic.

The acoustic module models simultaneously source/filter variations,f0 variations, and the temporal symbolic associated with a speak-ing style. Speakers f0 were normalized with respect to the speakingstyle prior to modelling. Source, filter, and normalized f0 observa-tion vectors and their dynamic vectors are used to estimate context-dependent HMM models λ

(style)acoustic. Context-dependent HMMs are

clustered into acoustically similar models using decision-tree-basedcontext-clustering (ML-MDL [15]). Multi-Space probability Distri-butions (MSD) [16] are used to model continuous/discrete parame-ter f0 sequence to manage voiced/unvoiced regions properly. Eachcontext-dependent HMM is modelled with a state duration probabil-ity density functions (PDFs) to account for the temporal structure ofspeech [17]. Finally, speech dynamic is modelled according to thetrajectory model and the global variance (GV) that model local andglobal speech variations over time [?].

3.2. Generation of the Speech Parameters

During the synthesis, the text is first converted into a concatenatedsequence of context-dependent HMM models λ

(style)symbolic associated

with the linguistic context sequence q = [q1, . . . ,qN ], whereqn = [q1, . . . , qL]� denotes the (Lx1) linguistic context vectorassociated with linguistic unit n.

Firstly, the prosodic symbolic qproso is inferred so as to maximizethe log-likelihood of the prosodic symbolic sequence qproso condi-tionally to the linguistic context sequence q and the model λ(style)

symbolic.

qproso = argmaxqproso

P(qproso|q, λ(style)symbolic) (2)

52CHAPTER 4. SPEECH PROSODY MODELLING & SYNTHESIS:

STATE-OF-THE-ART

A phonetizer is used to convert the input text into a sequence of phonemes. A syllabifieris used to merge the sequence phonemes into a sequence of syllables. At the prosodic level,pauses are identified to

4.6 Abstract Model: Text To Prosodic Structure

The prosodic structure model is to infer a sequence of prosodic labels that are associatedwith relevant prosodic events.

sentence Longtemps , je me suis couche de bonne heure .

⇓prosodicstructureFM * *Fm * * *P * * * *

syllable Long- temps ## je me suis cou- che de bonne heure ##

Table 4.2: Illustration of the text-to-prosodic-structure conversion.

Two main approaches can be distinguished in prosodic structure modelling: on the onehand, expert approaches attempt at elaborating formal models that account for the observedprosodic variations with respect to linguistic, para-linguistic, and extra-linguistic con-straints. On the other hand, statistical methods attempt at elaborating a statistical modelwhich accounts for the prosodic variations from the observation of statistical regularities onlarge speech corpora.

4.6.1 Expert Models

Expert approaches mostly concern hierarchical prosodic structure mod-elling, and in particular prosodic frontiers ([Cooper and Paccia-Cooper, 1980,Gee and Grosjean, 1983, Ferreira, 1988, Abney, 1992, Watson and Gibson, 2004]for English; [Dell, 1984, Bailly, 1989, Monnin and Grosjean, 1993, Ladd, 1996,Delais-Roussarie, 2000, Mertens, 2004b] for French, [Barbosa, 2006] for some otherlanguages). Expert models assume that a prosodic structure results from the integrationof various and potentially conflictual constraints, in particular syntactic and rhythmic[?, Dell, 1984] constraints.The linguistic module mostly concerns the extraction of prominent syn-tactic boundaries from deep syntactic parsing, based on syntactic con-stituency (Constituent-Depth [Cooper and Paccia-Cooper, 1980], φ-phrases[Gee and Grosjean, 1983, Delais-Roussarie, 2000], Left-hand-side / Right-hand-sideBoundary [Watson and Gibson, 2004]), syntactic dependency (Dependency-Grammar-basedtime [s]

am

plitude

!0.4

!0.2

0

0.2

0.4

40

60

80

100

120

f0

[Hz]

l

o

ta Z

@m@

sH ik

uS

ed2b O n 9 R

time [s]

am

plitu

de

!0.4

!0.2

0

0.2

0.4

40

60

80

100

120

f0

[Hz]

l

o

ta Z

@m@

sH ik

uS

ed2b O n 9 R

52CHAPTER 4. SPEECH PROSODY MODELLING & SYNTHESIS:

STATE-OF-THE-ART

A phonetizer is used to convert the input text into a sequence of phonemes. A syllabifieris used to merge the sequence phonemes into a sequence of syllables. At the prosodic level,pauses are identified to

4.6 Abstract Model: Text To Prosodic Structure

The prosodic structure model is to infer a sequence of prosodic labels that are associatedwith relevant prosodic events.

sentence Longtemps , je me suis couche de bonne heure .

⇓prosodicstructureFM * *Fm * * *P * * * *

syllable Long- temps ## je me suis cou- che de bonne heure ##

Table 4.2: Illustration of the text-to-prosodic-structure conversion.

Two main approaches can be distinguished in prosodic structure modelling: on the onehand, expert approaches attempt at elaborating formal models that account for the observedprosodic variations with respect to linguistic, para-linguistic, and extra-linguistic con-straints. On the other hand, statistical methods attempt at elaborating a statistical modelwhich accounts for the prosodic variations from the observation of statistical regularities onlarge speech corpora.

4.6.1 Expert Models

Expert approaches mostly concern hierarchical prosodic structure mod-elling, and in particular prosodic frontiers ([Cooper and Paccia-Cooper, 1980,Gee and Grosjean, 1983, Ferreira, 1988, Abney, 1992, Watson and Gibson, 2004]for English; [Dell, 1984, Bailly, 1989, Monnin and Grosjean, 1993, Ladd, 1996,Delais-Roussarie, 2000, Mertens, 2004b] for French, [Barbosa, 2006] for some otherlanguages). Expert models assume that a prosodic structure results from the integrationof various and potentially conflictual constraints, in particular syntactic and rhythmic[?, Dell, 1984] constraints.The linguistic module mostly concerns the extraction of prominent syn-tactic boundaries from deep syntactic parsing, based on syntactic con-stituency (Constituent-Depth [Cooper and Paccia-Cooper, 1980], φ-phrases[Gee and Grosjean, 1983, Delais-Roussarie, 2000], Left-hand-side / Right-hand-sideBoundary [Watson and Gibson, 2004]), syntactic dependency (Dependency-Grammar-based

52CHAPTER 4. SPEECH PROSODY MODELLING & SYNTHESIS:

STATE-OF-THE-ART

A phonetizer is used to convert the input text into a sequence of phonemes. A syllabifieris used to merge the sequence phonemes into a sequence of syllables. At the prosodic level,pauses are identified to

4.6 Abstract Model: Text To Prosodic Structure

The prosodic structure model is to infer a sequence of prosodic labels that are associatedwith relevant prosodic events.

sentence Longtemps , je me suis couche de bonne heure .

⇓prosodicstructureFM * *Fm * * *P * * * *

syllable Long- temps ## je me suis cou- che de bonne heure ##

Table 4.2: Illustration of the text-to-prosodic-structure conversion.

Two main approaches can be distinguished in prosodic structure modelling: on the onehand, expert approaches attempt at elaborating formal models that account for the observedprosodic variations with respect to linguistic, para-linguistic, and extra-linguistic con-straints. On the other hand, statistical methods attempt at elaborating a statistical modelwhich accounts for the prosodic variations from the observation of statistical regularities onlarge speech corpora.

4.6.1 Expert Models

Expert approaches mostly concern hierarchical prosodic structure mod-elling, and in particular prosodic frontiers ([Cooper and Paccia-Cooper, 1980,Gee and Grosjean, 1983, Ferreira, 1988, Abney, 1992, Watson and Gibson, 2004]for English; [Dell, 1984, Bailly, 1989, Monnin and Grosjean, 1993, Ladd, 1996,Delais-Roussarie, 2000, Mertens, 2004b] for French, [Barbosa, 2006] for some otherlanguages). Expert models assume that a prosodic structure results from the integrationof various and potentially conflictual constraints, in particular syntactic and rhythmic[?, Dell, 1984] constraints.The linguistic module mostly concerns the extraction of prominent syn-tactic boundaries from deep syntactic parsing, based on syntactic con-stituency (Constituent-Depth [Cooper and Paccia-Cooper, 1980], φ-phrases[Gee and Grosjean, 1983, Delais-Roussarie, 2000], Left-hand-side / Right-hand-sideBoundary [Watson and Gibson, 2004]), syntactic dependency (Dependency-Grammar-based

52CHAPTER 4. SPEECH PROSODY MODELLING & SYNTHESIS:

STATE-OF-THE-ART

A phonetizer is used to convert the input text into a sequence of phonemes. A syllabifieris used to merge the sequence phonemes into a sequence of syllables. At the prosodic level,pauses are identified to

4.6 Abstract Model: Text To Prosodic Structure

The prosodic structure model is to infer a sequence of prosodic labels that are associatedwith relevant prosodic events.

sentence Longtemps , je me suis couche de bonne heure .

⇓prosodicstructureFM * *Fm * * *P * * * *

syllable Long- temps ## je me suis cou- che de bonne heure ##

Table 4.2: Illustration of the text-to-prosodic-structure conversion.

Two main approaches can be distinguished in prosodic structure modelling: on the onehand, expert approaches attempt at elaborating formal models that account for the observedprosodic variations with respect to linguistic, para-linguistic, and extra-linguistic con-straints. On the other hand, statistical methods attempt at elaborating a statistical modelwhich accounts for the prosodic variations from the observation of statistical regularities onlarge speech corpora.

4.6.1 Expert Models

Expert approaches mostly concern hierarchical prosodic structure mod-elling, and in particular prosodic frontiers ([Cooper and Paccia-Cooper, 1980,Gee and Grosjean, 1983, Ferreira, 1988, Abney, 1992, Watson and Gibson, 2004]for English; [Dell, 1984, Bailly, 1989, Monnin and Grosjean, 1993, Ladd, 1996,Delais-Roussarie, 2000, Mertens, 2004b] for French, [Barbosa, 2006] for some otherlanguages). Expert models assume that a prosodic structure results from the integrationof various and potentially conflictual constraints, in particular syntactic and rhythmic[?, Dell, 1984] constraints.The linguistic module mostly concerns the extraction of prominent syn-tactic boundaries from deep syntactic parsing, based on syntactic con-stituency (Constituent-Depth [Cooper and Paccia-Cooper, 1980], φ-phrases[Gee and Grosjean, 1983, Delais-Roussarie, 2000], Left-hand-side / Right-hand-sideBoundary [Watson and Gibson, 2004]), syntactic dependency (Dependency-Grammar-based

52CHAPTER 4. SPEECH PROSODY MODELLING & SYNTHESIS:

STATE-OF-THE-ART

A phonetizer is used to convert the input text into a sequence of phonemes. A syllabifieris used to merge the sequence phonemes into a sequence of syllables. At the prosodic level,pauses are identified to

4.6 Abstract Model: Text To Prosodic Structure

The prosodic structure model is to infer a sequence of prosodic labels that are associatedwith relevant prosodic events.

sentence Longtemps , je me suis couche de bonne heure .

⇓prosodicstructureFM * *Fm * * *P * * * *

syllable Long- temps ## je me suis cou- che de bonne heure ##

Table 4.2: Illustration of the text-to-prosodic-structure conversion.

Two main approaches can be distinguished in prosodic structure modelling: on the onehand, expert approaches attempt at elaborating formal models that account for the observedprosodic variations with respect to linguistic, para-linguistic, and extra-linguistic con-straints. On the other hand, statistical methods attempt at elaborating a statistical modelwhich accounts for the prosodic variations from the observation of statistical regularities onlarge speech corpora.

4.6.1 Expert Models

Expert approaches mostly concern hierarchical prosodic structure mod-elling, and in particular prosodic frontiers ([Cooper and Paccia-Cooper, 1980,Gee and Grosjean, 1983, Ferreira, 1988, Abney, 1992, Watson and Gibson, 2004]for English; [Dell, 1984, Bailly, 1989, Monnin and Grosjean, 1993, Ladd, 1996,Delais-Roussarie, 2000, Mertens, 2004b] for French, [Barbosa, 2006] for some otherlanguages). Expert models assume that a prosodic structure results from the integrationof various and potentially conflictual constraints, in particular syntactic and rhythmic[?, Dell, 1984] constraints.The linguistic module mostly concerns the extraction of prominent syn-tactic boundaries from deep syntactic parsing, based on syntactic con-stituency (Constituent-Depth [Cooper and Paccia-Cooper, 1980], φ-phrases[Gee and Grosjean, 1983, Delais-Roussarie, 2000], Left-hand-side / Right-hand-sideBoundary [Watson and Gibson, 2004]), syntactic dependency (Dependency-Grammar-based

• journal: radio review (press review; political, economical, tech-nological chronicles); almost single speaker monologue with afew interactions with a lead journalist.

• sport commentary: soccer; two speakers engaged in mono-logues with speech overlapping during intense soccer sequencesand speech turn changes; almost no interactions.

The speech database consists of natural speech multi-media audiocontents with strongly variable audio quality (background noise:crowd, audience, recording noise, and reverberation). The speechprosody characteristics of the speech databased are illustrated in fig-ure 1.

3. SPEAKING STYLE MODEL

A speaking style model λ(style) is composed of discrete/continuouscontext-dependent HMMs that model the symbolic/acoustic speechcharacteristics of a speaking style.

λ(style) =“λ

(style)symbolic, λ

(style)acoustic

”(1)

During the training, the discrete/continuous context-dependentHMMs are estimated separately. During the synthesis, the sym-bolic/acoustic parameters are generated in cascade, from the sym-bolic representation to the acoustic variations. Additionally, a richlinguistic description of the text characteristics is automatically ex-tracted using a linguistic processing chain [9] and used to refine thecontext-dependent HMM modelling (see [10] and [11] for a detaileddescription of the enriched linguistic contexts).

3.1. Training of the Discrete/Continuous Models

3.1.1. Discrete HMM

For each speaking style, an average context-dependent discreteHMM λ

(style)symbolic is estimated from the pooled speakers associated

with the speaking style.

The prosodic grammar consists of a hierarchical prosodic repre-sentation that was experimented as an alternative to TOBI [12]for French prosody labelling [13]. The prosodic grammar iscomposed of major prosodic boundaries (FM, a boundary whichis right bounded by a pause), minor prosodic boundaries (Fm, anintermediate boundary), and prosodic prominences (P).

Let R be the number of speakers from which an average modelλ

(style)symbolic is to be estimated. Let qproso = (q

(1)proso, . . . ,q

(R)proso)

the total set of prosodic symbolic observations, andq

(r)proso = [q

(r)proso(1), . . . , q

(r)proso(Nr)] is the prosodic sym-

bolic sequence associated with speaker r, where q(r)proso(n)

is the prosodic label associated with syllable n. Letq = (q(1), . . . ,q(R)) the total set of linguistic contextsobservations, and q(r) = [q(r)(1), . . . ,q(r)(Nr)] is the lin-guistic context sequence associated with speaker r, whereq(r)(n) = [q

(r)1 (n), . . . , q

(r)L (n)]� is the (Lx1) linguistic context

vector which describes the linguistic characteristics associated withsyllable n.

An average context-dependent discrete HMM λ(style)symbolic is estimated

from the pooled speakers observations. Firstly, an average context-dependent tree T(style)

symbolic is derived so as to minimize the infor-

mation entropy of the prosodic symbolic labels qproso condition-ally to the linguistic contexts q . Then, a context-dependent HMMmodel λ

(style)symbolic is estimated for each terminal node of the context-

dependent tree T(style)symbolic.

3.1.2. Continuous HMM

For each speaking style, an average acoustic model λ(style)acoustic that

includes source/filter variations, f0 variations, and state-durations,is estimated from the pooled speakers associated with the speakingstyle based on the conventional HTS system ([14]).

Let R be the number of speakers from which an average model is tobe estimated. Let o = (o(1), . . . ,o(R)) the total set of observations,and o(r) = [o(r)(1), . . . ,o(r)(Tr)] is the observation sequencesassociated with speaker r, where o(r)(t) = [o

(r)t (1), . . . , o

(r)t (D)]�

is the (Dx1) observation vector which describes the acoustical prop-erty at time t. Let q = (q(1), . . . ,q(R)) the total set of linguisticcontexts observations, and q(r) = [q(r)(1), . . . ,q(r)(Tr)] isthe linguistic context sequence associated with speaker r, whereq(r)(t) = [q

(r)1 (t), . . . , q

(r)L (t)]� is the (L’x1) augmented linguistic

context vector which describes the linguistic properties at time t.

An average context-dependent HMM acoustic model λ(style)symbolic

is estimated from the pooled speakers observations. Firstly, acontext-dependent HMM model is estimated for each of thelinguistic contexts. Then, an average context-dependent treeT(style)

acoustic is derived so as to minimize the description length of thecontext-dependent HMM model λ

(style)acoustic.

The acoustic module models simultaneously source/filter variations,f0 variations, and the temporal symbolic associated with a speak-ing style. Speakers f0 were normalized with respect to the speakingstyle prior to modelling. Source, filter, and normalized f0 observa-tion vectors and their dynamic vectors are used to estimate context-dependent HMM models λ

(style)acoustic. Context-dependent HMMs are

clustered into acoustically similar models using decision-tree-basedcontext-clustering (ML-MDL [15]). Multi-Space probability Distri-butions (MSD) [16] are used to model continuous/discrete parame-ter f0 sequence to manage voiced/unvoiced regions properly. Eachcontext-dependent HMM is modelled with a state duration probabil-ity density functions (PDFs) to account for the temporal structure ofspeech [17]. Finally, speech dynamic is modelled according to thetrajectory model and the global variance (GV) that model local andglobal speech variations over time [?].

3.2. Generation of the Speech Parameters

During the synthesis, the text is first converted into a concatenatedsequence of context-dependent HMM models λ

(style)symbolic associated

with the linguistic context sequence q = [q1, . . . ,qN ], whereqn = [q1, . . . , qL]� denotes the (Lx1) linguistic context vectorassociated with linguistic unit n.

Firstly, the prosodic symbolic qproso is inferred so as to maximizethe log-likelihood of the prosodic symbolic sequence qproso condi-tionally to the linguistic context sequence q and the model λ(style)

symbolic.

qproso = argmaxqproso

P(qproso|q, λ(style)symbolic) (2)

• journal: radio review (press review; political, economical, tech-nological chronicles); almost single speaker monologue with afew interactions with a lead journalist.

• sport commentary: soccer; two speakers engaged in mono-logues with speech overlapping during intense soccer sequencesand speech turn changes; almost no interactions.

The speech database consists of natural speech multi-media audiocontents with strongly variable audio quality (background noise:crowd, audience, recording noise, and reverberation). The speechprosody characteristics of the speech databased are illustrated in fig-ure 1.

3. SPEAKING STYLE MODEL

A speaking style model λ(style) is composed of discrete/continuouscontext-dependent HMMs that model the symbolic/acoustic speechcharacteristics of a speaking style.

λ(style) =“λ

(style)symbolic, λ

(style)acoustic

”(1)

During the training, the discrete/continuous context-dependentHMMs are estimated separately. During the synthesis, the sym-bolic/acoustic parameters are generated in cascade, from the sym-bolic representation to the acoustic variations. Additionally, a richlinguistic description of the text characteristics is automatically ex-tracted using a linguistic processing chain [9] and used to refine thecontext-dependent HMM modelling (see [10] and [11] for a detaileddescription of the enriched linguistic contexts).

3.1. Training of the Discrete/Continuous Models

3.1.1. Discrete HMM

For each speaking style, an average context-dependent discreteHMM λ

(style)symbolic is estimated from the pooled speakers associated

with the speaking style.

The prosodic grammar consists of a hierarchical prosodic repre-sentation that was experimented as an alternative to TOBI [12]for French prosody labelling [13]. The prosodic grammar iscomposed of major prosodic boundaries (FM, a boundary whichis right bounded by a pause), minor prosodic boundaries (Fm, anintermediate boundary), and prosodic prominences (P).

Let R be the number of speakers from which an average modelλ

(style)symbolic is to be estimated. Let qproso = (q

(1)proso, . . . ,q

(R)proso)

the total set of prosodic symbolic observations, andq

(r)proso = [q

(r)proso(1), . . . , q

(r)proso(Nr)] is the prosodic sym-

bolic sequence associated with speaker r, where q(r)proso(n)

is the prosodic label associated with syllable n. Letq = (q(1), . . . ,q(R)) the total set of linguistic contextsobservations, and q(r) = [q(r)(1), . . . ,q(r)(Nr)] is the lin-guistic context sequence associated with speaker r, whereq(r)(n) = [q

(r)1 (n), . . . , q

(r)L (n)]� is the (Lx1) linguistic context

vector which describes the linguistic characteristics associated withsyllable n.

An average context-dependent discrete HMM λ(style)symbolic is estimated

from the pooled speakers observations. Firstly, an average context-dependent tree T(style)

symbolic is derived so as to minimize the infor-

mation entropy of the prosodic symbolic labels qproso condition-ally to the linguistic contexts q . Then, a context-dependent HMMmodel λ

(style)symbolic is estimated for each terminal node of the context-

dependent tree T(style)symbolic.

3.1.2. Continuous HMM

For each speaking style, an average acoustic model λ(style)acoustic that

includes source/filter variations, f0 variations, and state-durations,is estimated from the pooled speakers associated with the speakingstyle based on the conventional HTS system ([14]).

Let R be the number of speakers from which an average model is tobe estimated. Let o = (o(1), . . . ,o(R)) the total set of observations,and o(r) = [o(r)(1), . . . ,o(r)(Tr)] is the observation sequencesassociated with speaker r, where o(r)(t) = [o

(r)t (1), . . . , o

(r)t (D)]�

is the (Dx1) observation vector which describes the acoustical prop-erty at time t. Let q = (q(1), . . . ,q(R)) the total set of linguisticcontexts observations, and q(r) = [q(r)(1), . . . ,q(r)(Tr)] isthe linguistic context sequence associated with speaker r, whereq(r)(t) = [q

(r)1 (t), . . . , q

(r)L (t)]� is the (L’x1) augmented linguistic

context vector which describes the linguistic properties at time t.

An average context-dependent HMM acoustic model λ(style)symbolic

is estimated from the pooled speakers observations. Firstly, acontext-dependent HMM model is estimated for each of thelinguistic contexts. Then, an average context-dependent treeT(style)

acoustic is derived so as to minimize the description length of thecontext-dependent HMM model λ

(style)acoustic.

The acoustic module models simultaneously source/filter variations,f0 variations, and the temporal symbolic associated with a speak-ing style. Speakers f0 were normalized with respect to the speakingstyle prior to modelling. Source, filter, and normalized f0 observa-tion vectors and their dynamic vectors are used to estimate context-dependent HMM models λ

(style)acoustic. Context-dependent HMMs are

clustered into acoustically similar models using decision-tree-basedcontext-clustering (ML-MDL [15]). Multi-Space probability Distri-butions (MSD) [16] are used to model continuous/discrete parame-ter f0 sequence to manage voiced/unvoiced regions properly. Eachcontext-dependent HMM is modelled with a state duration probabil-ity density functions (PDFs) to account for the temporal structure ofspeech [17]. Finally, speech dynamic is modelled according to thetrajectory model and the global variance (GV) that model local andglobal speech variations over time [?].

3.2. Generation of the Speech Parameters

During the synthesis, the text is first converted into a concatenatedsequence of context-dependent HMM models λ

(style)symbolic associated

with the linguistic context sequence q = [q1, . . . ,qN ], whereqn = [q1, . . . , qL]� denotes the (Lx1) linguistic context vectorassociated with linguistic unit n.

Firstly, the prosodic symbolic qproso is inferred so as to maximizethe log-likelihood of the prosodic symbolic sequence qproso condi-tionally to the linguistic context sequence q and the model λ(style)

symbolic.

qproso = argmaxqproso

P(qproso|q, λ(style)symbolic) (2)

52CHAPTER 4. SPEECH PROSODY MODELLING & SYNTHESIS:

STATE-OF-THE-ART

A phonetizer is used to convert the input text into a sequence of phonemes. A syllabifieris used to merge the sequence phonemes into a sequence of syllables. At the prosodic level,pauses are identified to

4.6 Abstract Model: Text To Prosodic Structure

The prosodic structure model is to infer a sequence of prosodic labels that are associatedwith relevant prosodic events.

sentence Longtemps , je me suis couche de bonne heure .

⇓prosodicstructureFM * *Fm * * *P * * * *

syllable Long- temps ## je me suis cou- che de bonne heure ##

Table 4.2: Illustration of the text-to-prosodic-structure conversion.

Two main approaches can be distinguished in prosodic structure modelling: on the onehand, expert approaches attempt at elaborating formal models that account for the observedprosodic variations with respect to linguistic, para-linguistic, and extra-linguistic con-straints. On the other hand, statistical methods attempt at elaborating a statistical modelwhich accounts for the prosodic variations from the observation of statistical regularities onlarge speech corpora.

4.6.1 Expert Models

Expert approaches mostly concern hierarchical prosodic structure mod-elling, and in particular prosodic frontiers ([Cooper and Paccia-Cooper, 1980,Gee and Grosjean, 1983, Ferreira, 1988, Abney, 1992, Watson and Gibson, 2004]for English; [Dell, 1984, Bailly, 1989, Monnin and Grosjean, 1993, Ladd, 1996,Delais-Roussarie, 2000, Mertens, 2004b] for French, [Barbosa, 2006] for some otherlanguages). Expert models assume that a prosodic structure results from the integrationof various and potentially conflictual constraints, in particular syntactic and rhythmic[?, Dell, 1984] constraints.The linguistic module mostly concerns the extraction of prominent syn-tactic boundaries from deep syntactic parsing, based on syntactic con-stituency (Constituent-Depth [Cooper and Paccia-Cooper, 1980], φ-phrases[Gee and Grosjean, 1983, Delais-Roussarie, 2000], Left-hand-side / Right-hand-sideBoundary [Watson and Gibson, 2004]), syntactic dependency (Dependency-Grammar-based

time [s]

am

plitude

!0.4

!0.2

0

0.2

0.4

40

60

80

100

120

f0

[Hz]

l

o

ta Z

@m@

sH ik

uS

ed2b O n 9 R

52CHAPTER 4. SPEECH PROSODY MODELLING & SYNTHESIS:

STATE-OF-THE-ART

A phonetizer is used to convert the input text into a sequence of phonemes. A syllabifieris used to merge the sequence phonemes into a sequence of syllables. At the prosodic level,pauses are identified to

4.6 Abstract Model: Text To Prosodic Structure

The prosodic structure model is to infer a sequence of prosodic labels that are associatedwith relevant prosodic events.

sentence Longtemps , je me suis couche de bonne heure .

⇓prosodicstructureFM * *Fm * * *P * * * *

syllable Long- temps ## je me suis cou- che de bonne heure ##

Table 4.2: Illustration of the text-to-prosodic-structure conversion.

Two main approaches can be distinguished in prosodic structure modelling: on the onehand, expert approaches attempt at elaborating formal models that account for the observedprosodic variations with respect to linguistic, para-linguistic, and extra-linguistic con-straints. On the other hand, statistical methods attempt at elaborating a statistical modelwhich accounts for the prosodic variations from the observation of statistical regularities onlarge speech corpora.

4.6.1 Expert Models

Expert approaches mostly concern hierarchical prosodic structure mod-elling, and in particular prosodic frontiers ([Cooper and Paccia-Cooper, 1980,Gee and Grosjean, 1983, Ferreira, 1988, Abney, 1992, Watson and Gibson, 2004]for English; [Dell, 1984, Bailly, 1989, Monnin and Grosjean, 1993, Ladd, 1996,Delais-Roussarie, 2000, Mertens, 2004b] for French, [Barbosa, 2006] for some otherlanguages). Expert models assume that a prosodic structure results from the integrationof various and potentially conflictual constraints, in particular syntactic and rhythmic[?, Dell, 1984] constraints.The linguistic module mostly concerns the extraction of prominent syn-tactic boundaries from deep syntactic parsing, based on syntactic con-stituency (Constituent-Depth [Cooper and Paccia-Cooper, 1980], φ-phrases[Gee and Grosjean, 1983, Delais-Roussarie, 2000], Left-hand-side / Right-hand-sideBoundary [Watson and Gibson, 2004]), syntactic dependency (Dependency-Grammar-based

52CHAPTER 4. SPEECH PROSODY MODELLING & SYNTHESIS:

STATE-OF-THE-ART

A phonetizer is used to convert the input text into a sequence of phonemes. A syllabifieris used to merge the sequence phonemes into a sequence of syllables. At the prosodic level,pauses are identified to

4.6 Abstract Model: Text To Prosodic Structure

The prosodic structure model is to infer a sequence of prosodic labels that are associatedwith relevant prosodic events.

sentence Longtemps , je me suis couche de bonne heure .

⇓prosodicstructureFM * *Fm * * *P * * * *

syllable Long- temps ## je me suis cou- che de bonne heure ##

Table 4.2: Illustration of the text-to-prosodic-structure conversion.

Two main approaches can be distinguished in prosodic structure modelling: on the onehand, expert approaches attempt at elaborating formal models that account for the observedprosodic variations with respect to linguistic, para-linguistic, and extra-linguistic con-straints. On the other hand, statistical methods attempt at elaborating a statistical modelwhich accounts for the prosodic variations from the observation of statistical regularities onlarge speech corpora.

4.6.1 Expert Models

Expert approaches mostly concern hierarchical prosodic structure mod-elling, and in particular prosodic frontiers ([Cooper and Paccia-Cooper, 1980,Gee and Grosjean, 1983, Ferreira, 1988, Abney, 1992, Watson and Gibson, 2004]for English; [Dell, 1984, Bailly, 1989, Monnin and Grosjean, 1993, Ladd, 1996,Delais-Roussarie, 2000, Mertens, 2004b] for French, [Barbosa, 2006] for some otherlanguages). Expert models assume that a prosodic structure results from the integrationof various and potentially conflictual constraints, in particular syntactic and rhythmic[?, Dell, 1984] constraints.The linguistic module mostly concerns the extraction of prominent syn-tactic boundaries from deep syntactic parsing, based on syntactic con-stituency (Constituent-Depth [Cooper and Paccia-Cooper, 1980], φ-phrases[Gee and Grosjean, 1983, Delais-Roussarie, 2000], Left-hand-side / Right-hand-sideBoundary [Watson and Gibson, 2004]), syntactic dependency (Dependency-Grammar-based

52CHAPTER 4. SPEECH PROSODY MODELLING & SYNTHESIS:

STATE-OF-THE-ART

A phonetizer is used to convert the input text into a sequence of phonemes. A syllabifieris used to merge the sequence phonemes into a sequence of syllables. At the prosodic level,pauses are identified to

4.6 Abstract Model: Text To Prosodic Structure

The prosodic structure model is to infer a sequence of prosodic labels that are associatedwith relevant prosodic events.

sentence Longtemps , je me suis couche de bonne heure .

⇓prosodicstructureFM * *Fm * * *P * * * *

syllable Long- temps ## je me suis cou- che de bonne heure ##

Table 4.2: Illustration of the text-to-prosodic-structure conversion.

Two main approaches can be distinguished in prosodic structure modelling: on the onehand, expert approaches attempt at elaborating formal models that account for the observedprosodic variations with respect to linguistic, para-linguistic, and extra-linguistic con-straints. On the other hand, statistical methods attempt at elaborating a statistical modelwhich accounts for the prosodic variations from the observation of statistical regularities onlarge speech corpora.

4.6.1 Expert Models

Expert approaches mostly concern hierarchical prosodic structure mod-elling, and in particular prosodic frontiers ([Cooper and Paccia-Cooper, 1980,Gee and Grosjean, 1983, Ferreira, 1988, Abney, 1992, Watson and Gibson, 2004]for English; [Dell, 1984, Bailly, 1989, Monnin and Grosjean, 1993, Ladd, 1996,Delais-Roussarie, 2000, Mertens, 2004b] for French, [Barbosa, 2006] for some otherlanguages). Expert models assume that a prosodic structure results from the integrationof various and potentially conflictual constraints, in particular syntactic and rhythmic[?, Dell, 1984] constraints.The linguistic module mostly concerns the extraction of prominent syn-tactic boundaries from deep syntactic parsing, based on syntactic con-stituency (Constituent-Depth [Cooper and Paccia-Cooper, 1980], φ-phrases[Gee and Grosjean, 1983, Delais-Roussarie, 2000], Left-hand-side / Right-hand-sideBoundary [Watson and Gibson, 2004]), syntactic dependency (Dependency-Grammar-based

52CHAPTER 4. SPEECH PROSODY MODELLING & SYNTHESIS:

STATE-OF-THE-ART

A phonetizer is used to convert the input text into a sequence of phonemes. A syllabifieris used to merge the sequence phonemes into a sequence of syllables. At the prosodic level,pauses are identified to

4.6 Abstract Model: Text To Prosodic Structure

The prosodic structure model is to infer a sequence of prosodic labels that are associatedwith relevant prosodic events.

sentence Longtemps , je me suis couche de bonne heure .

⇓prosodicstructureFM * *Fm * * *P * * * *

syllable Long- temps ## je me suis cou- che de bonne heure ##

Table 4.2: Illustration of the text-to-prosodic-structure conversion.

Two main approaches can be distinguished in prosodic structure modelling: on the onehand, expert approaches attempt at elaborating formal models that account for the observedprosodic variations with respect to linguistic, para-linguistic, and extra-linguistic con-straints. On the other hand, statistical methods attempt at elaborating a statistical modelwhich accounts for the prosodic variations from the observation of statistical regularities onlarge speech corpora.

4.6.1 Expert Models

Expert approaches mostly concern hierarchical prosodic structure mod-elling, and in particular prosodic frontiers ([Cooper and Paccia-Cooper, 1980,Gee and Grosjean, 1983, Ferreira, 1988, Abney, 1992, Watson and Gibson, 2004]for English; [Dell, 1984, Bailly, 1989, Monnin and Grosjean, 1993, Ladd, 1996,Delais-Roussarie, 2000, Mertens, 2004b] for French, [Barbosa, 2006] for some otherlanguages). Expert models assume that a prosodic structure results from the integrationof various and potentially conflictual constraints, in particular syntactic and rhythmic[?, Dell, 1984] constraints.The linguistic module mostly concerns the extraction of prominent syn-tactic boundaries from deep syntactic parsing, based on syntactic con-stituency (Constituent-Depth [Cooper and Paccia-Cooper, 1980], φ-phrases[Gee and Grosjean, 1983, Delais-Roussarie, 2000], Left-hand-side / Right-hand-sideBoundary [Watson and Gibson, 2004]), syntactic dependency (Dependency-Grammar-based

• journal: radio review (press review; political, economical, tech-nological chronicles); almost single speaker monologue with afew interactions with a lead journalist.

• sport commentary: soccer; two speakers engaged in mono-logues with speech overlapping during intense soccer sequencesand speech turn changes; almost no interactions.

The speech database consists of natural speech multi-media audiocontents with strongly variable audio quality (background noise:crowd, audience, recording noise, and reverberation). The speechprosody characteristics of the speech databased are illustrated in fig-ure 1.

3. SPEAKING STYLE MODEL

A speaking style model λ(style) is composed of discrete/continuouscontext-dependent HMMs that model the symbolic/acoustic speechcharacteristics of a speaking style.

λ(style) =“λ

(style)symbolic, λ

(style)acoustic

”(1)

During the training, the discrete/continuous context-dependentHMMs are estimated separately. During the synthesis, the sym-bolic/acoustic parameters are generated in cascade, from the sym-bolic representation to the acoustic variations. Additionally, a richlinguistic description of the text characteristics is automatically ex-tracted using a linguistic processing chain [9] and used to refine thecontext-dependent HMM modelling (see [10] and [11] for a detaileddescription of the enriched linguistic contexts).

3.1. Training of the Discrete/Continuous Models

3.1.1. Discrete HMM

For each speaking style, an average context-dependent discreteHMM λ

(style)symbolic is estimated from the pooled speakers associated

with the speaking style.

The prosodic grammar consists of a hierarchical prosodic repre-sentation that was experimented as an alternative to TOBI [12]for French prosody labelling [13]. The prosodic grammar iscomposed of major prosodic boundaries (FM, a boundary whichis right bounded by a pause), minor prosodic boundaries (Fm, anintermediate boundary), and prosodic prominences (P).

Let R be the number of speakers from which an average modelλ

(style)symbolic is to be estimated. Let qproso = (q

(1)proso, . . . ,q

(R)proso)

the total set of prosodic symbolic observations, andq

(r)proso = [q

(r)proso(1), . . . , q

(r)proso(Nr)] is the prosodic sym-

bolic sequence associated with speaker r, where q(r)proso(n)

is the prosodic label associated with syllable n. Letq = (q(1), . . . ,q(R)) the total set of linguistic contextsobservations, and q(r) = [q(r)(1), . . . ,q(r)(Nr)] is the lin-guistic context sequence associated with speaker r, whereq(r)(n) = [q

(r)1 (n), . . . , q

(r)L (n)]� is the (Lx1) linguistic context

vector which describes the linguistic characteristics associated withsyllable n.

An average context-dependent discrete HMM λ(style)symbolic is estimated

from the pooled speakers observations. Firstly, an average context-dependent tree T(style)

symbolic is derived so as to minimize the infor-

mation entropy of the prosodic symbolic labels qproso condition-ally to the linguistic contexts q . Then, a context-dependent HMMmodel λ

(style)symbolic is estimated for each terminal node of the context-

dependent tree T(style)symbolic.

3.1.2. Continuous HMM

For each speaking style, an average acoustic model λ(style)acoustic that

includes source/filter variations, f0 variations, and state-durations,is estimated from the pooled speakers associated with the speakingstyle based on the conventional HTS system ([14]).

Let R be the number of speakers from which an average model is tobe estimated. Let o = (o(1), . . . ,o(R)) the total set of observations,and o(r) = [o(r)(1), . . . ,o(r)(Tr)] is the observation sequencesassociated with speaker r, where o(r)(t) = [o

(r)t (1), . . . , o

(r)t (D)]�

is the (Dx1) observation vector which describes the acoustical prop-erty at time t. Let q = (q(1), . . . ,q(R)) the total set of linguisticcontexts observations, and q(r) = [q(r)(1), . . . ,q(r)(Tr)] isthe linguistic context sequence associated with speaker r, whereq(r)(t) = [q

(r)1 (t), . . . , q

(r)L (t)]� is the (L’x1) augmented linguistic

context vector which describes the linguistic properties at time t.

An average context-dependent HMM acoustic model λ(style)symbolic

is estimated from the pooled speakers observations. Firstly, acontext-dependent HMM model is estimated for each of thelinguistic contexts. Then, an average context-dependent treeT(style)

acoustic is derived so as to minimize the description length of thecontext-dependent HMM model λ

(style)acoustic.

The acoustic module models simultaneously source/filter variations,f0 variations, and the temporal symbolic associated with a speak-ing style. Speakers f0 were normalized with respect to the speakingstyle prior to modelling. Source, filter, and normalized f0 observa-tion vectors and their dynamic vectors are used to estimate context-dependent HMM models λ

(style)acoustic. Context-dependent HMMs are

clustered into acoustically similar models using decision-tree-basedcontext-clustering (ML-MDL [15]). Multi-Space probability Distri-butions (MSD) [16] are used to model continuous/discrete parame-ter f0 sequence to manage voiced/unvoiced regions properly. Eachcontext-dependent HMM is modelled with a state duration probabil-ity density functions (PDFs) to account for the temporal structure ofspeech [17]. Finally, speech dynamic is modelled according to thetrajectory model and the global variance (GV) that model local andglobal speech variations over time [?].

3.2. Generation of the Speech Parameters

During the synthesis, the text is first converted into a concatenatedsequence of context-dependent HMM models λ

(style)symbolic associated

with the linguistic context sequence q = [q1, . . . ,qN ], whereqn = [q1, . . . , qL]� denotes the (Lx1) linguistic context vectorassociated with linguistic unit n.

Firstly, the prosodic symbolic qproso is inferred so as to maximizethe log-likelihood of the prosodic symbolic sequence qproso condi-tionally to the linguistic context sequence q and the model λ(style)

symbolic.

qproso = argmaxqproso

P(qproso|q, λ(style)symbolic) (2)

• journal: radio review (press review; political, economical, tech-nological chronicles); almost single speaker monologue with afew interactions with a lead journalist.

• sport commentary: soccer; two speakers engaged in mono-logues with speech overlapping during intense soccer sequencesand speech turn changes; almost no interactions.

The speech database consists of natural speech multi-media audiocontents with strongly variable audio quality (background noise:crowd, audience, recording noise, and reverberation). The speechprosody characteristics of the speech databased are illustrated in fig-ure 1.

3. SPEAKING STYLE MODEL

A speaking style model λ(style) is composed of discrete/continuouscontext-dependent HMMs that model the symbolic/acoustic speechcharacteristics of a speaking style.

λ(style) =“λ

(style)symbolic, λ

(style)acoustic

”(1)

During the training, the discrete/continuous context-dependentHMMs are estimated separately. During the synthesis, the sym-bolic/acoustic parameters are generated in cascade, from the sym-bolic representation to the acoustic variations. Additionally, a richlinguistic description of the text characteristics is automatically ex-tracted using a linguistic processing chain [9] and used to refine thecontext-dependent HMM modelling (see [10] and [11] for a detaileddescription of the enriched linguistic contexts).

3.1. Training of the Discrete/Continuous Models

3.1.1. Discrete HMM

For each speaking style, an average context-dependent discreteHMM λ

(style)symbolic is estimated from the pooled speakers associated

with the speaking style.

The prosodic grammar consists of a hierarchical prosodic repre-sentation that was experimented as an alternative to TOBI [12]for French prosody labelling [13]. The prosodic grammar iscomposed of major prosodic boundaries (FM, a boundary whichis right bounded by a pause), minor prosodic boundaries (Fm, anintermediate boundary), and prosodic prominences (P).

Let R be the number of speakers from which an average modelλ

(style)symbolic is to be estimated. Let qproso = (q

(1)proso, . . . ,q

(R)proso)

the total set of prosodic symbolic observations, andq

(r)proso = [q

(r)proso(1), . . . , q

(r)proso(Nr)] is the prosodic sym-

bolic sequence associated with speaker r, where q(r)proso(n)

is the prosodic label associated with syllable n. Letq = (q(1), . . . ,q(R)) the total set of linguistic contextsobservations, and q(r) = [q(r)(1), . . . ,q(r)(Nr)] is the lin-guistic context sequence associated with speaker r, whereq(r)(n) = [q

(r)1 (n), . . . , q

(r)L (n)]� is the (Lx1) linguistic context

vector which describes the linguistic characteristics associated withsyllable n.

An average context-dependent discrete HMM λ(style)symbolic is estimated

from the pooled speakers observations. Firstly, an average context-dependent tree T(style)

symbolic is derived so as to minimize the infor-

mation entropy of the prosodic symbolic labels qproso condition-ally to the linguistic contexts q . Then, a context-dependent HMMmodel λ

(style)symbolic is estimated for each terminal node of the context-

dependent tree T(style)symbolic.

3.1.2. Continuous HMM

For each speaking style, an average acoustic model λ(style)acoustic that

includes source/filter variations, f0 variations, and state-durations,is estimated from the pooled speakers associated with the speakingstyle based on the conventional HTS system ([14]).

Let R be the number of speakers from which an average model is tobe estimated. Let o = (o(1), . . . ,o(R)) the total set of observations,and o(r) = [o(r)(1), . . . ,o(r)(Tr)] is the observation sequencesassociated with speaker r, where o(r)(t) = [o

(r)t (1), . . . , o

(r)t (D)]�

is the (Dx1) observation vector which describes the acoustical prop-erty at time t. Let q = (q(1), . . . ,q(R)) the total set of linguisticcontexts observations, and q(r) = [q(r)(1), . . . ,q(r)(Tr)] isthe linguistic context sequence associated with speaker r, whereq(r)(t) = [q

(r)1 (t), . . . , q

(r)L (t)]� is the (L’x1) augmented linguistic

context vector which describes the linguistic properties at time t.

An average context-dependent HMM acoustic model λ(style)symbolic

is estimated from the pooled speakers observations. Firstly, acontext-dependent HMM model is estimated for each of thelinguistic contexts. Then, an average context-dependent treeT(style)

acoustic is derived so as to minimize the description length of thecontext-dependent HMM model λ

(style)acoustic.

The acoustic module models simultaneously source/filter variations,f0 variations, and the temporal symbolic associated with a speak-ing style. Speakers f0 were normalized with respect to the speakingstyle prior to modelling. Source, filter, and normalized f0 observa-tion vectors and their dynamic vectors are used to estimate context-dependent HMM models λ

(style)acoustic. Context-dependent HMMs are

clustered into acoustically similar models using decision-tree-basedcontext-clustering (ML-MDL [15]). Multi-Space probability Distri-butions (MSD) [16] are used to model continuous/discrete parame-ter f0 sequence to manage voiced/unvoiced regions properly. Eachcontext-dependent HMM is modelled with a state duration probabil-ity density functions (PDFs) to account for the temporal structure ofspeech [17]. Finally, speech dynamic is modelled according to thetrajectory model and the global variance (GV) that model local andglobal speech variations over time [?].

3.2. Generation of the Speech Parameters

During the synthesis, the text is first converted into a concatenatedsequence of context-dependent HMM models λ

(style)symbolic associated

with the linguistic context sequence q = [q1, . . . ,qN ], whereqn = [q1, . . . , qL]� denotes the (Lx1) linguistic context vectorassociated with linguistic unit n.

Firstly, the prosodic symbolic qproso is inferred so as to maximizethe log-likelihood of the prosodic symbolic sequence qproso condi-tionally to the linguistic context sequence q and the model λ(style)

symbolic.

qproso = argmaxqproso

P(qproso|q, λ(style)symbolic) (2)

52CHAPTER 4. SPEECH PROSODY MODELLING & SYNTHESIS:

STATE-OF-THE-ART

A phonetizer is used to convert the input text into a sequence of phonemes. A syllabifieris used to merge the sequence phonemes into a sequence of syllables. At the prosodic level,pauses are identified to

4.6 Abstract Model: Text To Prosodic Structure

The prosodic structure model is to infer a sequence of prosodic labels that are associatedwith relevant prosodic events.

sentence Longtemps , je me suis couche de bonne heure .

⇓prosodicstructureFM * *Fm * * *P * * * *

syllable Long- temps ## je me suis cou- che de bonne heure ##

Table 4.2: Illustration of the text-to-prosodic-structure conversion.

Two main approaches can be distinguished in prosodic structure modelling: on the onehand, expert approaches attempt at elaborating formal models that account for the observedprosodic variations with respect to linguistic, para-linguistic, and extra-linguistic con-straints. On the other hand, statistical methods attempt at elaborating a statistical modelwhich accounts for the prosodic variations from the observation of statistical regularities onlarge speech corpora.

4.6.1 Expert Models

Expert approaches mostly concern hierarchical prosodic structure mod-elling, and in particular prosodic frontiers ([Cooper and Paccia-Cooper, 1980,Gee and Grosjean, 1983, Ferreira, 1988, Abney, 1992, Watson and Gibson, 2004]for English; [Dell, 1984, Bailly, 1989, Monnin and Grosjean, 1993, Ladd, 1996,Delais-Roussarie, 2000, Mertens, 2004b] for French, [Barbosa, 2006] for some otherlanguages). Expert models assume that a prosodic structure results from the integrationof various and potentially conflictual constraints, in particular syntactic and rhythmic[?, Dell, 1984] constraints.The linguistic module mostly concerns the extraction of prominent syn-tactic boundaries from deep syntactic parsing, based on syntactic con-stituency (Constituent-Depth [Cooper and Paccia-Cooper, 1980], φ-phrases[Gee and Grosjean, 1983, Delais-Roussarie, 2000], Left-hand-side / Right-hand-sideBoundary [Watson and Gibson, 2004]), syntactic dependency (Dependency-Grammar-basedtime [s]

am

plitude

!0.4

!0.2

0

0.2

0.4

40

60

80

100

120

f0

[Hz]

l

o

ta Z

@m@

sH ik

uS

ed2b O n 9 R

time [s]

am

plitu

de

!0.4

!0.2

0

0.2

0.4

40

60

80

100

120

f0

[Hz]

l

o

ta Z

@m@

sH ik

uS

ed2b O n 9 R

52CHAPTER 4. SPEECH PROSODY MODELLING & SYNTHESIS:

STATE-OF-THE-ART

A phonetizer is used to convert the input text into a sequence of phonemes. A syllabifieris used to merge the sequence phonemes into a sequence of syllables. At the prosodic level,pauses are identified to

4.6 Abstract Model: Text To Prosodic Structure

The prosodic structure model is to infer a sequence of prosodic labels that are associatedwith relevant prosodic events.

sentence Longtemps , je me suis couche de bonne heure .

⇓prosodicstructureFM * *Fm * * *P * * * *

syllable Long- temps ## je me suis cou- che de bonne heure ##

Table 4.2: Illustration of the text-to-prosodic-structure conversion.

Two main approaches can be distinguished in prosodic structure modelling: on the onehand, expert approaches attempt at elaborating formal models that account for the observedprosodic variations with respect to linguistic, para-linguistic, and extra-linguistic con-straints. On the other hand, statistical methods attempt at elaborating a statistical modelwhich accounts for the prosodic variations from the observation of statistical regularities onlarge speech corpora.

4.6.1 Expert Models

Expert approaches mostly concern hierarchical prosodic structure mod-elling, and in particular prosodic frontiers ([Cooper and Paccia-Cooper, 1980,Gee and Grosjean, 1983, Ferreira, 1988, Abney, 1992, Watson and Gibson, 2004]for English; [Dell, 1984, Bailly, 1989, Monnin and Grosjean, 1993, Ladd, 1996,Delais-Roussarie, 2000, Mertens, 2004b] for French, [Barbosa, 2006] for some otherlanguages). Expert models assume that a prosodic structure results from the integrationof various and potentially conflictual constraints, in particular syntactic and rhythmic[?, Dell, 1984] constraints.The linguistic module mostly concerns the extraction of prominent syn-tactic boundaries from deep syntactic parsing, based on syntactic con-stituency (Constituent-Depth [Cooper and Paccia-Cooper, 1980], φ-phrases[Gee and Grosjean, 1983, Delais-Roussarie, 2000], Left-hand-side / Right-hand-sideBoundary [Watson and Gibson, 2004]), syntactic dependency (Dependency-Grammar-based

52CHAPTER 4. SPEECH PROSODY MODELLING & SYNTHESIS:

STATE-OF-THE-ART

A phonetizer is used to convert the input text into a sequence of phonemes. A syllabifieris used to merge the sequence phonemes into a sequence of syllables. At the prosodic level,pauses are identified to

4.6 Abstract Model: Text To Prosodic Structure

The prosodic structure model is to infer a sequence of prosodic labels that are associatedwith relevant prosodic events.

sentence Longtemps , je me suis couche de bonne heure .

⇓prosodicstructureFM * *Fm * * *P * * * *

syllable Long- temps ## je me suis cou- che de bonne heure ##

Table 4.2: Illustration of the text-to-prosodic-structure conversion.

Two main approaches can be distinguished in prosodic structure modelling: on the onehand, expert approaches attempt at elaborating formal models that account for the observedprosodic variations with respect to linguistic, para-linguistic, and extra-linguistic con-straints. On the other hand, statistical methods attempt at elaborating a statistical modelwhich accounts for the prosodic variations from the observation of statistical regularities onlarge speech corpora.

4.6.1 Expert Models

Expert approaches mostly concern hierarchical prosodic structure mod-elling, and in particular prosodic frontiers ([Cooper and Paccia-Cooper, 1980,Gee and Grosjean, 1983, Ferreira, 1988, Abney, 1992, Watson and Gibson, 2004]for English; [Dell, 1984, Bailly, 1989, Monnin and Grosjean, 1993, Ladd, 1996,Delais-Roussarie, 2000, Mertens, 2004b] for French, [Barbosa, 2006] for some otherlanguages). Expert models assume that a prosodic structure results from the integrationof various and potentially conflictual constraints, in particular syntactic and rhythmic[?, Dell, 1984] constraints.The linguistic module mostly concerns the extraction of prominent syn-tactic boundaries from deep syntactic parsing, based on syntactic con-stituency (Constituent-Depth [Cooper and Paccia-Cooper, 1980], φ-phrases[Gee and Grosjean, 1983, Delais-Roussarie, 2000], Left-hand-side / Right-hand-sideBoundary [Watson and Gibson, 2004]), syntactic dependency (Dependency-Grammar-based

52CHAPTER 4. SPEECH PROSODY MODELLING & SYNTHESIS:

STATE-OF-THE-ART

A phonetizer is used to convert the input text into a sequence of phonemes. A syllabifieris used to merge the sequence phonemes into a sequence of syllables. At the prosodic level,pauses are identified to

4.6 Abstract Model: Text To Prosodic Structure

The prosodic structure model is to infer a sequence of prosodic labels that are associatedwith relevant prosodic events.

sentence Longtemps , je me suis couche de bonne heure .

⇓prosodicstructureFM * *Fm * * *P * * * *

syllable Long- temps ## je me suis cou- che de bonne heure ##

Table 4.2: Illustration of the text-to-prosodic-structure conversion.

Two main approaches can be distinguished in prosodic structure modelling: on the onehand, expert approaches attempt at elaborating formal models that account for the observedprosodic variations with respect to linguistic, para-linguistic, and extra-linguistic con-straints. On the other hand, statistical methods attempt at elaborating a statistical modelwhich accounts for the prosodic variations from the observation of statistical regularities onlarge speech corpora.

4.6.1 Expert Models

Expert approaches mostly concern hierarchical prosodic structure mod-elling, and in particular prosodic frontiers ([Cooper and Paccia-Cooper, 1980,Gee and Grosjean, 1983, Ferreira, 1988, Abney, 1992, Watson and Gibson, 2004]for English; [Dell, 1984, Bailly, 1989, Monnin and Grosjean, 1993, Ladd, 1996,Delais-Roussarie, 2000, Mertens, 2004b] for French, [Barbosa, 2006] for some otherlanguages). Expert models assume that a prosodic structure results from the integrationof various and potentially conflictual constraints, in particular syntactic and rhythmic[?, Dell, 1984] constraints.The linguistic module mostly concerns the extraction of prominent syn-tactic boundaries from deep syntactic parsing, based on syntactic con-stituency (Constituent-Depth [Cooper and Paccia-Cooper, 1980], φ-phrases[Gee and Grosjean, 1983, Delais-Roussarie, 2000], Left-hand-side / Right-hand-sideBoundary [Watson and Gibson, 2004]), syntactic dependency (Dependency-Grammar-based

52CHAPTER 4. SPEECH PROSODY MODELLING & SYNTHESIS:

STATE-OF-THE-ART

A phonetizer is used to convert the input text into a sequence of phonemes. A syllabifieris used to merge the sequence phonemes into a sequence of syllables. At the prosodic level,pauses are identified to

4.6 Abstract Model: Text To Prosodic Structure

The prosodic structure model is to infer a sequence of prosodic labels that are associatedwith relevant prosodic events.

sentence Longtemps , je me suis couche de bonne heure .

⇓prosodicstructureFM * *Fm * * *P * * * *

syllable Long- temps ## je me suis cou- che de bonne heure ##

Table 4.2: Illustration of the text-to-prosodic-structure conversion.

Two main approaches can be distinguished in prosodic structure modelling: on the onehand, expert approaches attempt at elaborating formal models that account for the observedprosodic variations with respect to linguistic, para-linguistic, and extra-linguistic con-straints. On the other hand, statistical methods attempt at elaborating a statistical modelwhich accounts for the prosodic variations from the observation of statistical regularities onlarge speech corpora.

4.6.1 Expert Models

Expert approaches mostly concern hierarchical prosodic structure mod-elling, and in particular prosodic frontiers ([Cooper and Paccia-Cooper, 1980,Gee and Grosjean, 1983, Ferreira, 1988, Abney, 1992, Watson and Gibson, 2004]for English; [Dell, 1984, Bailly, 1989, Monnin and Grosjean, 1993, Ladd, 1996,Delais-Roussarie, 2000, Mertens, 2004b] for French, [Barbosa, 2006] for some otherlanguages). Expert models assume that a prosodic structure results from the integrationof various and potentially conflictual constraints, in particular syntactic and rhythmic[?, Dell, 1984] constraints.The linguistic module mostly concerns the extraction of prominent syn-tactic boundaries from deep syntactic parsing, based on syntactic con-stituency (Constituent-Depth [Cooper and Paccia-Cooper, 1980], φ-phrases[Gee and Grosjean, 1983, Delais-Roussarie, 2000], Left-hand-side / Right-hand-sideBoundary [Watson and Gibson, 2004]), syntactic dependency (Dependency-Grammar-based

• journal: radio review (press review; political, economical, tech-nological chronicles); almost single speaker monologue with afew interactions with a lead journalist.

• sport commentary: soccer; two speakers engaged in mono-logues with speech overlapping during intense soccer sequencesand speech turn changes; almost no interactions.

The speech database consists of natural speech multi-media audiocontents with strongly variable audio quality (background noise:crowd, audience, recording noise, and reverberation). The speechprosody characteristics of the speech databased are illustrated in fig-ure 1.

3. SPEAKING STYLE MODEL

A speaking style model λ(style) is composed of discrete/continuouscontext-dependent HMMs that model the symbolic/acoustic speechcharacteristics of a speaking style.

λ(style) =“λ

(style)symbolic, λ

(style)acoustic

”(1)

During the training, the discrete/continuous context-dependentHMMs are estimated separately. During the synthesis, the sym-bolic/acoustic parameters are generated in cascade, from the sym-bolic representation to the acoustic variations. Additionally, a richlinguistic description of the text characteristics is automatically ex-tracted using a linguistic processing chain [9] and used to refine thecontext-dependent HMM modelling (see [10] and [11] for a detaileddescription of the enriched linguistic contexts).

3.1. Training of the Discrete/Continuous Models

3.1.1. Discrete HMM

For each speaking style, an average context-dependent discreteHMM λ

(style)symbolic is estimated from the pooled speakers associated

with the speaking style.

The prosodic grammar consists of a hierarchical prosodic repre-sentation that was experimented as an alternative to TOBI [12]for French prosody labelling [13]. The prosodic grammar iscomposed of major prosodic boundaries (FM, a boundary whichis right bounded by a pause), minor prosodic boundaries (Fm, anintermediate boundary), and prosodic prominences (P).

Let R be the number of speakers from which an average modelλ

(style)symbolic is to be estimated. Let qproso = (q

(1)proso, . . . ,q

(R)proso)

the total set of prosodic symbolic observations, andq

(r)proso = [q

(r)proso(1), . . . , q

(r)proso(Nr)] is the prosodic sym-

bolic sequence associated with speaker r, where q(r)proso(n)

is the prosodic label associated with syllable n. Letq = (q(1), . . . ,q(R)) the total set of linguistic contextsobservations, and q(r) = [q(r)(1), . . . ,q(r)(Nr)] is the lin-guistic context sequence associated with speaker r, whereq(r)(n) = [q

(r)1 (n), . . . , q

(r)L (n)]� is the (Lx1) linguistic context

vector which describes the linguistic characteristics associated withsyllable n.

An average context-dependent discrete HMM λ(style)symbolic is estimated

from the pooled speakers observations. Firstly, an average context-dependent tree T(style)

symbolic is derived so as to minimize the infor-

mation entropy of the prosodic symbolic labels qproso condition-ally to the linguistic contexts q . Then, a context-dependent HMMmodel λ

(style)symbolic is estimated for each terminal node of the context-

dependent tree T(style)symbolic.

3.1.2. Continuous HMM

For each speaking style, an average acoustic model λ(style)acoustic that

includes source/filter variations, f0 variations, and state-durations,is estimated from the pooled speakers associated with the speakingstyle based on the conventional HTS system ([14]).

Let R be the number of speakers from which an average model is tobe estimated. Let o = (o(1), . . . ,o(R)) the total set of observations,and o(r) = [o(r)(1), . . . ,o(r)(Tr)] is the observation sequencesassociated with speaker r, where o(r)(t) = [o

(r)t (1), . . . , o

(r)t (D)]�

is the (Dx1) observation vector which describes the acoustical prop-erty at time t. Let q = (q(1), . . . ,q(R)) the total set of linguisticcontexts observations, and q(r) = [q(r)(1), . . . ,q(r)(Tr)] isthe linguistic context sequence associated with speaker r, whereq(r)(t) = [q

(r)1 (t), . . . , q

(r)L (t)]� is the (L’x1) augmented linguistic

context vector which describes the linguistic properties at time t.

An average context-dependent HMM acoustic model λ(style)symbolic

is estimated from the pooled speakers observations. Firstly, acontext-dependent HMM model is estimated for each of thelinguistic contexts. Then, an average context-dependent treeT(style)

acoustic is derived so as to minimize the description length of thecontext-dependent HMM model λ

(style)acoustic.

The acoustic module models simultaneously source/filter variations,f0 variations, and the temporal symbolic associated with a speak-ing style. Speakers f0 were normalized with respect to the speakingstyle prior to modelling. Source, filter, and normalized f0 observa-tion vectors and their dynamic vectors are used to estimate context-dependent HMM models λ

(style)acoustic. Context-dependent HMMs are

clustered into acoustically similar models using decision-tree-basedcontext-clustering (ML-MDL [15]). Multi-Space probability Distri-butions (MSD) [16] are used to model continuous/discrete parame-ter f0 sequence to manage voiced/unvoiced regions properly. Eachcontext-dependent HMM is modelled with a state duration probabil-ity density functions (PDFs) to account for the temporal structure ofspeech [17]. Finally, speech dynamic is modelled according to thetrajectory model and the global variance (GV) that model local andglobal speech variations over time [?].

3.2. Generation of the Speech Parameters

During the synthesis, the text is first converted into a concatenatedsequence of context-dependent HMM models λ

(style)symbolic associated

with the linguistic context sequence q = [q1, . . . ,qN ], whereqn = [q1, . . . , qL]� denotes the (Lx1) linguistic context vectorassociated with linguistic unit n.

Firstly, the prosodic symbolic qproso is inferred so as to maximizethe log-likelihood of the prosodic symbolic sequence qproso condi-tionally to the linguistic context sequence q and the model λ(style)

symbolic.

qproso = argmaxqproso

P(qproso|q, λ(style)symbolic) (2)

• journal: radio review (press review; political, economical, tech-nological chronicles); almost single speaker monologue with afew interactions with a lead journalist.

• sport commentary: soccer; two speakers engaged in mono-logues with speech overlapping during intense soccer sequencesand speech turn changes; almost no interactions.

The speech database consists of natural speech multi-media audiocontents with strongly variable audio quality (background noise:crowd, audience, recording noise, and reverberation). The speechprosody characteristics of the speech databased are illustrated in fig-ure 1.

3. SPEAKING STYLE MODEL

A speaking style model λ(style) is composed of discrete/continuouscontext-dependent HMMs that model the symbolic/acoustic speechcharacteristics of a speaking style.

λ(style) =“λ

(style)symbolic, λ

(style)acoustic

”(1)

During the training, the discrete/continuous context-dependentHMMs are estimated separately. During the synthesis, the sym-bolic/acoustic parameters are generated in cascade, from the sym-bolic representation to the acoustic variations. Additionally, a richlinguistic description of the text characteristics is automatically ex-tracted using a linguistic processing chain [9] and used to refine thecontext-dependent HMM modelling (see [10] and [11] for a detaileddescription of the enriched linguistic contexts).

3.1. Training of the Discrete/Continuous Models

3.1.1. Discrete HMM

For each speaking style, an average context-dependent discreteHMM λ

(style)symbolic is estimated from the pooled speakers associated

with the speaking style.

The prosodic grammar consists of a hierarchical prosodic repre-sentation that was experimented as an alternative to TOBI [12]for French prosody labelling [13]. The prosodic grammar iscomposed of major prosodic boundaries (FM, a boundary whichis right bounded by a pause), minor prosodic boundaries (Fm, anintermediate boundary), and prosodic prominences (P).

Let R be the number of speakers from which an average modelλ

(style)symbolic is to be estimated. Let qproso = (q

(1)proso, . . . ,q

(R)proso)

the total set of prosodic symbolic observations, andq

(r)proso = [q

(r)proso(1), . . . , q

(r)proso(Nr)] is the prosodic sym-

bolic sequence associated with speaker r, where q(r)proso(n)

is the prosodic label associated with syllable n. Letq = (q(1), . . . ,q(R)) the total set of linguistic contextsobservations, and q(r) = [q(r)(1), . . . ,q(r)(Nr)] is the lin-guistic context sequence associated with speaker r, whereq(r)(n) = [q

(r)1 (n), . . . , q

(r)L (n)]� is the (Lx1) linguistic context

vector which describes the linguistic characteristics associated withsyllable n.

An average context-dependent discrete HMM λ(style)symbolic is estimated

from the pooled speakers observations. Firstly, an average context-dependent tree T(style)

symbolic is derived so as to minimize the infor-

mation entropy of the prosodic symbolic labels qproso condition-ally to the linguistic contexts q . Then, a context-dependent HMMmodel λ

(style)symbolic is estimated for each terminal node of the context-

dependent tree T(style)symbolic.

3.1.2. Continuous HMM

For each speaking style, an average acoustic model λ(style)acoustic that

includes source/filter variations, f0 variations, and state-durations,is estimated from the pooled speakers associated with the speakingstyle based on the conventional HTS system ([14]).

Let R be the number of speakers from which an average model is tobe estimated. Let o = (o(1), . . . ,o(R)) the total set of observations,and o(r) = [o(r)(1), . . . ,o(r)(Tr)] is the observation sequencesassociated with speaker r, where o(r)(t) = [o

(r)t (1), . . . , o

(r)t (D)]�

is the (Dx1) observation vector which describes the acoustical prop-erty at time t. Let q = (q(1), . . . ,q(R)) the total set of linguisticcontexts observations, and q(r) = [q(r)(1), . . . ,q(r)(Tr)] isthe linguistic context sequence associated with speaker r, whereq(r)(t) = [q

(r)1 (t), . . . , q

(r)L (t)]� is the (L’x1) augmented linguistic

context vector which describes the linguistic properties at time t.

An average context-dependent HMM acoustic model λ(style)symbolic

is estimated from the pooled speakers observations. Firstly, acontext-dependent HMM model is estimated for each of thelinguistic contexts. Then, an average context-dependent treeT(style)

acoustic is derived so as to minimize the description length of thecontext-dependent HMM model λ

(style)acoustic.

The acoustic module models simultaneously source/filter variations,f0 variations, and the temporal symbolic associated with a speak-ing style. Speakers f0 were normalized with respect to the speakingstyle prior to modelling. Source, filter, and normalized f0 observa-tion vectors and their dynamic vectors are used to estimate context-dependent HMM models λ

(style)acoustic. Context-dependent HMMs are

clustered into acoustically similar models using decision-tree-basedcontext-clustering (ML-MDL [15]). Multi-Space probability Distri-butions (MSD) [16] are used to model continuous/discrete parame-ter f0 sequence to manage voiced/unvoiced regions properly. Eachcontext-dependent HMM is modelled with a state duration probabil-ity density functions (PDFs) to account for the temporal structure ofspeech [17]. Finally, speech dynamic is modelled according to thetrajectory model and the global variance (GV) that model local andglobal speech variations over time [?].

3.2. Generation of the Speech Parameters

During the synthesis, the text is first converted into a concatenatedsequence of context-dependent HMM models λ

(style)symbolic associated

with the linguistic context sequence q = [q1, . . . ,qN ], whereqn = [q1, . . . , qL]� denotes the (Lx1) linguistic context vectorassociated with linguistic unit n.

Firstly, the prosodic symbolic qproso is inferred so as to maximizethe log-likelihood of the prosodic symbolic sequence qproso condi-tionally to the linguistic context sequence q and the model λ(style)

symbolic.

qproso = argmaxqproso

P(qproso|q, λ(style)symbolic) (2)

Fig. 2. Generation of speech parameters.

4.1. Experimental Setup

40 speech utterances (10 per DG) were se-lected in the speaking style corpus and re-moved from the training set. Lexical ac-cess was removed using a band-pass filterthat insured that the lowest frequency ofthe fundamental frequency and the high-est frequency of its first harmonic was in-cluded. .

4.2. Subjective Evaluation

The evaluation consists of a multiplechoice identification task from speechprosody perception. The evaluation was

44 CHAPTER 4. PROSODY EXTRACTION

4.5 Prosodic Parameters Estimation

4.5.1 Fundamental Frequency (f0)

The fundamental frequency f0 and periodicity are estimated using the STRAIGHTalgorithm [Kawahara et al., 1999], a frequency-based fundamental frequency estimationmethod based on instantaneous frequency estimation and fixed-point analysis.

The analysis was performed using a 50-ms. blackmann window and a 5 ms. frame rate. F0

boundaries set for the analysis were manually adapted depending on the characteristics ofthe speaker. The voiced/unvoiced regions were decided using the aperiodicity measure.

Estimation of f0 variations (in blue, superimposed to the spectrogram) for the utterance: :”Longtemps, je me suis couche de bonne heure.” (”For a long time I used to go to bed

early”).

4.5.2 Syllable duration

˜ ˜

44 CHAPTER 4. PROSODY EXTRACTION

4.5 Prosodic Parameters Estimation

4.5.1 Fundamental Frequency (f0)

The fundamental frequency f0 and periodicity are estimated using the STRAIGHTalgorithm [Kawahara et al., 1999], a frequency-based fundamental frequency estimationmethod based on instantaneous frequency estimation and fixed-point analysis.

The analysis was performed using a 50-ms. blackmann window and a 5 ms. frame rate. F0

boundaries set for the analysis were manually adapted depending on the characteristics ofthe speaker. The voiced/unvoiced regions were decided using the aperiodicity measure.

Estimation of f0 variations (in blue, superimposed to the spectrogram) for the utterance: :”Longtemps, je me suis couche de bonne heure.” (”For a long time I used to go to bed

early”).

4.5.2 Syllable duration

˜ ˜

Figure 2: Generation of discrete/continuous speech pa-rameters for the sentence: “Longtemps, je me suis couchede bonne heure” (“For a long time I used to go to bedearly”).

4. EvaluationThe proposed model has been evaluated based on a speakingstyle identification perceptual experiment, and compared to aspeaking style identification experiment with natural speech[15]. For the purpose of such a comparison, it was necessaryto provide a single evaluation scheme for both experiments. Inparticular, it was not possible to control the linguistic content ofnatural speech utterances which provides evident cues for DG’sidentification (a single keyword would be sufficient to identify aDG). Thus, such a comparison required to remove lexical accessand to focus on the prosodic dimension only.

4.1. Experimental Setup

40 speech utterances (10 per DG) were selected in the speak-ing style corpus and removed from the training set. Lexical ac-cess was removed using a band-pass filter that insured that thelowest frequency of the fundamental frequency and the highestfrequency of its first harmonic was included.

4.2. Subjective Evaluation

The evaluation consists of a multiple choice identification taskfrom speech prosody perception. The evaluation was conductedaccording to crowd-sourcing technique using social networks.50 subjects (including 25 native French speakers, 15 non-nativeFrench speakers, 10 non-French speakers; 34 expert and 16naıve listeners) participated in this experiment. Participantswere given a brief description of the different speaking styles.Then, they were asked to associate a speaking style to each ofthe speech utterances. For this purpose, participants were giventhree options:

total confidence: select only one speaking style when certainof the choice;confusion: select two different speaking styles when two speak-ing styles are possible;total indecision: select ”indecision” when completely unsure.Subjects were asked to use this possibility only as a very lastresort.Additional informations were gleaned from the participants:speech expertise (expert, naıve), language (native Frenchspeaker, non-native French speaker, non-French speaker), age,and listening condition (headphones or not). Expert participantswere actually coming from various domains (speech and audiotechnologies, linguistics, musicians). Participants were encour-aged to use headphones.

5. Results & DiscussionIdentification performance was estimated using a measurebased on Cohen’s Kappa statistic [16]. Cohen’s Kappa statis-tic measures the proportion of agreement between two raterswith correction for random agreement. Our measure monitorsthe agreement between the ratings of the participants and theground truth. The measure varies from -1 to 1: -1 is perfectdisagreement; 0 is chance; 1 is perfect agreement. Confusionratings were considered as equally possible ratings. Total inde-cision ratings were relatively rare (3% of the total ratings) andremoved. Figure 3 presents the identification confusion matrix.Overall score reveals fair identification performance (κ =0.38 ± 0.04) which is comparable to that observed for iden-tification from natural speech (κnatural = 0.45 ± 0.03). Theidentification performance significantly depends on the speak-ing style (figure 4): sport commentary is substantially identified(κ = 0.68± 0.05), journal fairly identified (κ = 0.50± 0.06),political discourse moderately identified (κ = 0.28±0.07), andmass only slightly identified (κ = 0.12 ± 0.06). In compari-son with identification from natural speech, the identification iscomparable in the case of the sport commentary and the jour-nal speaking styles (κnatural = 0.70 ± 0.03 and κnatural =0.54 ± 0.05, respectively). However, there is a drop in identi-fication for the political and the mass speaking styles which isespecially significant for the mass style (κnatural = 0.34±0.05and κnatural = 0.38 ± 0.04, respectively). This indicates thatthe model somehow failed to capture the relevant cues of thecorresponding speaking style. Nevertheless, a large confusion

Page 4: Discrete/Continuous Modelling of Speaking Style in …articles.ircam.fr/textes/Obin11b/index.pdfDiscrete/Continuous Modelling of Speaking Style in HMM-based Speech Synthesis: Design

MASS

POLITIC

AL

JOURNAL

SPORT

MASS

POLITIC

AL

JOURNAL

SPORT

390

237

28

43

166

357

116

38

83

64

460

73

7

2

47

470

(a) natural speech

MASS

POLITICAL

JOURNAL

SPORT

MASS

POLITICAL

JOURNAL

SPORT

165

209

18

53

154

245

87

32

131

54

348

41

53

3

28

365

MASS

POLITICAL

JOURNAL

SPORT

MASS

POLITICAL

JOURNAL

SPORT

390

237

28

43

166

357

116

38

83

64

460

73

7

2

47

470

(b) synthetic speech

Figure 3: Identification confusion matrices. Rows representsynthesized speaking style. Columns represent identified speak-ing style.

exists between the political and the mass speech that is inherentto a similarity in the speaking style and the formal situation inwhich the speech occurs. Additionally, the conventional HMM-based speech synthesis system failed into modelling adequatelythe breathiness and the creakiness that is specific to the politicalspeaking style, especially within unvoiced segments.ANOVA analysis was conducted to assess whether the iden-tification performance depends on the language of the partic-ipants. Analysis reveals a significant effect of the language(F(2, 59) = 15, p < 0.001) (F(48,2)=5.9, p-value=0.005), andconfirms results obtained for natural speech. This confirms evi-dence that there exists variations of a speaking style dependingon the language and/or cultural background.Finally, an informal evaluation of the quality of the synthesizedspeech suggests that the speaking style modelling is robust tothe large variety of audio quality.

6. ConclusionIn this study, the ability and the robustness of a HMM-basedspeech synthesis system to model the speech characteristics ofvarious speaking styles were assessed. A discrete/continuousHMM was presented to model the symbolic and acoustic speechcharacteristics of a speaking style, and used to model the aver-age characteristics of a speaking style that is shared among var-ious speakers, depending on specific situations of speech com-munication. The evaluation consisted of an identification ex-periment of 4 speaking styles based on delexicalized speech,and compared with a similar experiment on natural speech. Theevaluation showed that the discrete/continuous HMM consis-

MASS POLITICAL JOURNAL SPORT0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.12

0.28

0.51

0.68

0.38

0.34

0.54

0.70

Coh

en’s

Kap

pa

NATURAL SPEECHSYNTHESIZED SPEECH

Figure 4: Mean identification scores and 95% confidence inter-val obtained for natural and synthesized speech.

tently models the speech characteristics of a speaking style, andis robust to the differences in audio quality. This proves evi-dence that the discrete/continuous HMM speech synthesis sys-tem successfully models the speech characteristics of a speak-ing style in the conditions of real-world applications.

7. References[1] A.-C. Simon, A. Auchlin, M. Avanzi, and J.-P. Goldman, Les voix des

Francais. Peter Lang, 2009, ch. Les phonostyles: une descriptionprosodique des styles de parole en francais.

[2] H. Schmid and M. Atterer, “New statistical methods for phrase break predic-tion,” in International Conference On Computational Linguistics, Geneva,Switzerland, 2004, pp. 659–665.

[3] P. Bell, T. Burrows, and P. Taylor, “Adaptation of prosodic phrasing mod-els,” in Speech Prosody, Dresden, Germany, 2006.

[4] J. Yamagishi, T. Masuko, and T. Kobayashi, “HMM-based expressivespeech synthesis - Towards TTS with arbitrary speaking styles and emo-tions,” in Special Workshop in Maui, Maui, Hawaı, 2004.

[5] S. Krstulovic, A. Hunecke, and M. Schroder, “An HMM-based speech syn-thesis system applied to german and its adaptation to a limited set of expres-sive football announcements,” in Interspeech, 2007.

[6] E. Villemonte de La Clergerie, “From metagrammars to factorizedTAG/TIG parsers,” in International Workshop On Parsing Technology, Van-couver, Canada, Oct. 2005, pp. 190–191.

[7] N. Obin, P. Lanchantin, A. Lacheret, and X. Rodet, “Towards improvedHMM-based speech synthesis using high-level syntactical features,” inSpeech Prosody, Chicago, U.S.A., 2010.

[8] A. Lacheret, N. Obin, and M. Avanzi, “Design and Evaluation of SharedProsodic Annotation for Spontaneous French Speech: From Expert Knowl-edge to Non-Expert Annotation,” in Linguistic Annotation Workshop, Upp-sala, Sweden, 2010.

[9] K. Silverman, M. Beckman, J. Pitrelli, M. Ostendorf, C. Wightman, P. Price,J. Pierrehumbert, and J. Hirschberg, “ToBI: a standard for labeling en-glish prosody,” in International Conference of Spoken Language Process-ing, Banff, Canada, 1992, pp. 867–870.

[10] N. Obin, V. Dellwo, A. Lacheret, and X. Rodet, “Expectations for DiscourseGenre Identification: a Prosodic Study,” in Interspeech, Makuhari, Japan,2010, pp. 3070–3073.

[11] T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura,“Simultaneous modeling of spectrum, pitch and duration in HMM-basedspeech synthesis,” in European Conference on Speech Communication andTechnology, Budapest, Hungary, 1999, pp. 2347–2350.

[12] K. Tokuda, T. Masuko, N. Miyazaki, and T. Kobayashi, “Hidden Markovmodels based on multi-space probability distribution for pitch pattern mod-eling,” in International Conference on Audio, Speech, and Signal Process-ing, Phoenix, Arizona, 1999, pp. 229–232.

[13] H. Zen, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura, “Hiddensemi-Markov model based speech synthesis,” in International Conferenceon Speech and Language Processing, Jeju Island, Korea, 2004, pp. 1397–1400.

[14] T. Toda and K. Tokuda, “A speech parameter generation algorithm consider-ing global variance for HMM-based speech synthesis,” IEICE Transactionson Information and Systems, vol. 90, no. 5, pp. 816–824, 2007.

[15] N. Obin, A. Lacheret, and X. Rodet, “HMM-based prosodic structure modelusing rich linguistic context,” in Interspeech, Makuhari, Japan, 2010, pp.1133–1136.

[16] J. Cohen, “A coefficient of agreement for nominal scales,” Educational andPsychological Measurement, vol. 20, no. 1, pp. 37–46, 1960.