synthesis of some russian utterances by rules 1/1971 11. speech synthesis a. synthesis of some...

Dept. for Speech, Music and Hearing

Quarterly Progress andStatus Report

Synthesis of some Russianutterances by rules

Pauli, S. and Derkach, M.

journal: STL-QPSRvolume: 12number: 1year: 1971pages: 043-049

http://www.speech.kth.se/qpsr

http://www.speech.kth.se

http://www.speech.kth.se/qpsr

STL-QPSR 1/1971

11. SPEECH SYNTHESIS

A. SYNTHESIS O F SOME RUSSIAN UTTERANCES B Y RULES

S. Paul i and h4. Derkach-:+

The Purpose

The purpose of this work was to synthesize some intelligible Russian

ut terances with the OVE 111 synthesizer(') using smoothed step commands

and ru les derived f rom a spectrographic study of symmetr ical VCV-syl-

lables pronounced by a single Russian speaker. I

The OVE 111 Synthesizer

The control pa ramete r s a re :

Branch: Formants : Anti-Formants: Excitation: Quanta1 ~ t e p s ( d ~ ) : Source :

Vowel F 1 F 2 F3 1) Voice Noise

Nasal FN 3, AN 8 Voice

Fricat ive K 1 K 2 AI< 2, AC 4 Noise

1) F4 and F5 a r e kept constant.

2 ) The anti-formant frequency is given by AK in dB rel . K i ( + i 2 d ~ / o c t ) and cancels i t when AK i s 20 dB.

3) Single formant branch a lso used for non-nasal segments, e. g. the voice bar of voiced consonants.

The Synthesis P rogram ( 2 1

The control parameter data a r e s t ~ r e d in the computer on a phoneme

basis. Each data f r ame (phoneme), denoted by a one- o r two-character code

(e. g. S for [ s J , SH fo r [I] etc. ), contains specification of value and onset

time (=one data point) for each of the control pa ramete r s used. I t a lso con-

tains information of the total phoneme duration; parameter DRW.

One o r m o r e data points/control parameter in the phoneme can be spec-

ified. Any number of control pa ramete r s can be defined in a phoneme.

Thus in many cases some pa ramete r s a r e not specified a t all. The utterance 1

$6 Guest r e sea rche r f rom Lvov University, USSR, February 17 - December 15, 1970.

+*;* One DR-unit i s 20 m s e c long.

STL-QPSR 1/197i 44.

to be synthesized is given by a s t r ing of phoneme-codes (ASHA for [afa],

etc. ). Thus, fo r each control parameter a sequence of data points is a s -

sembled.

These a r e interpreted a s step commands; i. e. a parameter value is

adopted a t the t ime it i s specified and then i t is kept a t this value until a

new data point i s given and a t that very moment i t jumps to the new value.

Data points f r o m one phoneme overr ide those for the following one, if an

overlap in the t ime direction occurs , see Fig. 11-A-7a.

The se t of square -waves (one sq. w. fo r each control parameter ) i s then

smoothed in a digital lowpass fi l ter p r io r to output to the synthesizer. The

nominal time constant for the f i l te rs a re :

Pa ramete r s : Time constant (msec):

see the f igures , especially Fig. 11-A-7. Because of the particular read-

out speed chosen in this study the actual t ime constants were 25 O/o shorter .

The ru les a r e of two kinds:

I, Rules inherent in the syste due to the consequences of the 'smoothed square -wave philosophy' . (Pf

11. Rules re la t ing different phoneme categories (with respect to acoustic and ar t iculatory features) and control parameter values and timing. I

The r e s t of this ar t ic le will descr ibe the second kind of rules.

Phoneme Duration

The duration f o r a11 phonemes was chosen to 8 DR-units (160 msec) .

The prolongation phoneme, ":", adds 5 DR-units to the preceding vowel.

Vowels

The vowels a r e [u], [o], [a], [ e l , [i], and [ i ] . The formant frequen-

cies a r e given in Table 11-A-I. They a r e derived f r o m a spectrograpllic

study of 300 VCV-syllables, see ref. ( 3 ) . The formant data points Fi F Z F 3

a r e specified 3 DR-units before the nominal s t a r t of the phoneme (where A0

AO, A H

AO, A N n n 1:)d-i i\ [-)d-l 8 0

a s a a z a a t a a d a

F i g . 11-A- I . TIMING PATTERNS FOR EXCITATION PARAMETERS. (smoothed s t e p c o m m a n d s ) A0 is shown together with AC, AH, and AN r e s p . The m a n n e r of a r t i cu la t ion d e t e r - m i n e s the t iming p a t t e r n f o r the excitat ion p a r a m e t e r s. The d i f ferent t iming p a t t e r n s poss ib le a r e exempl i f ied by the den ta l s in CaCal -context . AH i s a lways swi tched or1 a t the s a m e t ime a s AC and i s kept constant throughout the phoneme. AN is swi tched

- on a t the beginning of a l l voiced consonants and is kept constant throughout the phon- e m e . The ac tual va lues a r e given in Table 11-A-3.

STL-QPSR 1 / 1 9 7 1

Table 11-A-3. AC, AN, and AH leve l s ( d ~ ) . F o r A0 l e v e l s , s ee Fig. 11-A- I.

P l a c e of ar t icula t ion

The t a r ge t values for the f requency co n t r o l - pa r ame te r s have been i der ived f r o m s p e c t r o g r a m s o i symmet r i c a l VCV-syllables, whe re C

I I

i s voiced and vo ice less f r i ca t ives ; [s] , [ z ] t a r g e t s have been u sed for

the denta ls etc. See Fig . 11-A-2. I

+

Rules:

~ ~ ( f r icative/sto@= = t 4 dB

~ ~ ( v o i c e l e s s/voiced) = = t 4 dB

AC (s t rong/weak)= = t 8 d B

I

Conso-

can t s :

Voiceless

Voiced

k H t ~ F2

2,2 - 290 - 1.8 - 1.6 - 194 - 192

be I 0 ,8 6 , l o l 1 1 I ) ,

0 .I .2 .3 4 -5 -6 .7 .8 .9 kHz

b

AN

0

8

Fig. 11-A-2. F i r s t and second fo rman t t r a c e s in [VzV]. The s ta t ionary p a r t of the f i r s t vowel and the middle of the [ z ] - segment give the two points on each line. A mean of the [ z ] - points was used in the f i r s t approach a s F I - F 2 t a rge t va lue for the dental group of consonants.

AH

1,ab. 12

Dent.

Alv. 20

Vel. 28

AC

Strong Weak

Stops

8

p k

4

b g

F r i c .

12

f 1 1

8

v

F r i c .

2 0

s . J

16

" 3

Stops

16

12

d

STL-QPSR 1/197 1

Manner of Articulation

The timing pattern for AC i s the same within the fricatives, merely the

levels differ; Fig. 11-A- l a and 11-A- 15. A different timing pattern i s used

within al l stops except for and [g], where the burst i s longer; thus the

AC-pattern s tar ts 1 DR-unit ear l ier than in the other stops. See Fig. 11-A- i c

and 11-A- id.

AH i s always switched on a t the same time a s AC and i s kept constant

throughout the remaining par t of the phoneme, (Fig. 11-A- I).

There a r e two different timing patterns for AO: one for voiceless and

one for voiced fricatives and stops, (Fig. 11-A- I).

AN, finally, i s switched on a t the beginning of all voiced fricatives and

stops and i s kept constant throughout the phoneme.

General Build-Up Procedure used for the Consonants

The voiceless fricatives [f], Cs], [I], and [h] were the f i r s t ones to

be constructed. F rom these the voiced counterparts (with respect to place

of articulation): [v] , [zl, and c 3 ] were made. [h] does not have a voiced

counterpart in Rus sian.

The voiceless stops were constructed from the corresponding voiceless

fricatives by switching off AC and AH in the f i r s t par t of the phoneme.

The voiced stops, finally, a r e made from the voiceless stops in the

same way a s the voiced fricatives a r e made from the voiceless fricatives.

Lists with synthetic VCV-words (C=fricative s and stops) have been pro - duced. After analysis of the resul ts from several listening tests with native

l is teners the formant and timing data have been changed to the values given

in this text; i. e . , an iterative procedure has been used to match our pho-

neme-data to the rule system.

Detailed Description of Fricatives and Stops

Voiceless Fricatives (Vl. F.J: A0 and AC a r e switched off r e sp. on in --------------- two stops (24 to 12, 12 to 0 d ~ ) a t the beginning and a t the end of the Vl. F. - The time constants of the digital f i l ters (see p. 44) used for smoothing A0

and AC i s too short for switching these parameters on and off in only one - step (from 24 to 0 dB) infr icat ives, (see Fig. 11-A-la). The V1.F. have

been divided into two subgroups: strong [s l , [f] and weak [f j, [hj with

STL-QPSR 1/1971 48.

respect to the required AC-level. F o r the strong group i t i s 8 dB higher

than that for the weak, i. e. : AC (strong/weak) = +8 dB. AH is kept constant

during the phoneme. Fig. 11-A- 3 shows natural and synthetic [aCa] -words,

C=Vl. F.

Voiced Fr ica t ives LV. F.): These a r e constructed f rom the V1. I?. , see ---------- -- Fig. 11-A-lb. The voicing i s accomplished by (a) Switching A0 off la te r

and on ea r l i e r in the V. F. than in the corresponding V1. F. Compare Fig.

11-A-la and 11-A- ib. (b) Switching on the nasal branch to provide a voice

bar (AN = 8 dB, FN = 250 HZ). The tense/lax component of the V. / ~ 1 .

distinction i s realized by the rule: AC(V1. )/Ac(v. ) = t 4 dB (Table 11-A-3).

Fig. 11-A-4 shows natural and synthetic [aCa] -words, C=V. F.

Voiceless Stops CV1. s.1: [p], [t ], and [k ] a r e constructed f rom the ------- - -- corresponding V1. F. [f I , [s], and [h] r e sp. The Vl. S. (Fig. 11-A- l c )

begin with a pause (6 DR-units long in [p] and [t] and 5 in [k] ) af te r which

the burst s t a r t s , i. e. AC and AH a r e switched on simultaneously. After

i DR-unit AC is switched off but AH remains on. Fr ica t ives a r e given 4 dB

higher AC-level than the corresponding stops, see Table II-A-3. Fig. 11-A-5

shows natural and synthetic [aCa] -words, C=Vl. S.

Voiced s tops 1V. s.1: The general V1. F. /v. F. relations hold fo r ----- - -- Vl.S./v.S. also. Thus the voicing in V.S. was achieved by adding the

"nasal" voice bar . AN a s in V. I?. but F N = 200 Hz, see Fig. 11-A- id and

Table 11-A-3. Fig. 11-A-6 shows natural and synthetic [aca lwords , C=V. S .

Palatalization

The soft-hard relation in the synthesis has been based on a study by

D e r k a ~ h ( ~ ) , and i s achieved by inserting a palatalization command, i. e. a

zero-duration phoneme with [i] -formants before the consonant to be palata-

lized. These formant- targets which override those of the consonant, a r e

reached in the consonant and kept there until the following vowel s ta r t s .

The transit ions will thus be delayed compared with a hard CV-segment, see

Fig. 11-A-7. Palatalization i s denoted by "' " af te r the palatalized consonant.

Spectrograms of natural and synthetic palatalized [aC' a ] -words (C=dental

consonant) a r e shown in Fig. 11-A-8. No confusions about the hard/soft

distinction in VCV-syllables were made in a tes t with 4 Russian l is teners .

NATURAL -kc,

7- - 6 i - 5 4

SYNTHETIC kc- 7- - 6- ' - 5- -

kc, . ?-

asa 3 f -

I I

k c - ' . .

1-; - 6-

I

kc, 3- - 6- - 5- -

kc,

kc, 7- - 6- - I

Fig. 11-A-3. Spectrograms of natural and synthetic [aCa] -words with the voiceless fricatives [f], [s]. , and [h].

kc, NATURAL kc - SYNTHETIC 7- --- - -- -. -- -- -- . - . 7- -

--- - - - - -- - - 6- 6- -

- 5- I - - 4- 4 -. -

1 i 1111" . Q

3- - -, - 2- - 1- -

4 1

o ~ l ~ t ~ I ~ l ~ ~ I ~ ' ~ - l -

0 .1 .2 .3 .L .S .6 .7

k c , , ,

7*

11

' I '

kc -' -- -..- A

k c - . - - - . 7-1 : 7-

Fig. 11-A-4. Spectrograms of natural and synthetic [aCa] - words with the voiced fricatives [v], [I, and [j 1.

aba

ada

kc - NATURAL 7- - 6- - 5 l

kc- 7- - 6- - 5- -

kc, SYNTHETIC

kc - 7- - 6- - 5- - 1 L- - $ 4 ( 1 I I Ill

k c , 7-! 1

6- 4 5" *

4-1

' I ' -7

Fig. II-A-6. Spectrograms of natural and synthetic [aCa]- words with the voiced stops [b], [dl, and [g].

Fig. 11-A-7. HARD -SOFT DISTINCTION. Note the [ i ]-like formant frequency positions in the consonant and the delayed transit ions in the soft segment compared to the hard one. The palatalization command is inserted before the consonant. The command has two data-points for each of F l , F 2 , and F3. The two a r rows in Fig. 11-A-7a a r e pointing a t the two data-points of Fl. F o r clarity the Fl transit ions in Fig. 11-A-7b has been drawn in dashed lines in Fig. 11-A-7a.

NATURAL

as'a

kc, 7- - 6- - 8- - 4- - 3- - 2- - 1-

4 0

Fig. 11-A-8. Spec t rog rams of n a tu r a l and synthetic pala ta l ized [aC' a> words with the dental consonants is], [z], [ t 1, and [dl.

d t e : v u f k a k a: k t ' e b ' a : z a v u: t -

Fig. 11-A-9. Spectrogram of the sentence [d' e:vujka ka:k t' eb' a: zavu:t] (Young girl, what i s your name ? )

STL-QPSR 1/1971

Results

13 sentences have been synthesized. One of them i s "Young gir l what i s

your name ? I t . In t ranscript ion [dJe:vd,fka 1ia:k tbb'a: zavu:t]. Spectrogram

of this sentence i s shown in Fig. II-A-9.

Conclusions

The program philoazphy and the rules described makes i t possible to

construct very compact phoneme data f rames . PI and [g] can be improved

by introducing back and front variants. The palatalization may sound exag-

gerated to non-Ukranian ears . This can be improved by a change in the

timing of the palatalization command. At present the system design imposes

restr ic t ions on the rate of formant transitions by a p rese t time constant for

each control parameter . This is one of the limitations which may be removed

in future developments if the quality shall be improved.

Acknowledgments

We wish to thank Johan Liljencrants, the originator of the square-wave

synthesis technique, whose ideas and programming made this work pos sible.

Many valuable discussions with Gunnar Fant gave u s new aspects on many

problems. We a r e also indebted to Si Felicetti for her editorial help, to I

Gunnel Karlsson who made the drawings and mountings, and to other friends l

and colleagues in this laboratory for their cooperation. \

References: 1 ( I ) J. Liljencrants: "The OVE I11 Speech Synthesizertt , LEEE Trans.

IW- 16, No. I (i968), pp. 137- 140; also in STL-QPSR 2-3/1967, pp. 76-81.

(2) J. Liljencrants: "Speech Synthesizer Control by Smoothed Step Functions", STL-QPSR 4/1969, pp. 43 -50.

(3) Af. Derkach: "Phoneinc Coarticulation in Russian Hard and Soft VCV- Utterances with Voiceless Fricatives", STL-QPSR 2-3/1970, pp. 1-7.

synthesis of some russian utterances by rules 1/1971 11. speech synthesis a. synthesis of some...

Documents