synthesis of some russian utterances by rules 1/1971 11. speech synthesis a. synthesis of some...
TRANSCRIPT
Dept. for Speech, Music and Hearing
Quarterly Progress andStatus Report
Synthesis of some Russianutterances by rules
Pauli, S. and Derkach, M.
journal: STL-QPSRvolume: 12number: 1year: 1971pages: 043-049
http://www.speech.kth.se/qpsr
STL-QPSR 1/1971
11. SPEECH SYNTHESIS
A. SYNTHESIS O F SOME RUSSIAN UTTERANCES B Y RULES
S. Paul i and h4. Derkach-:+
The Purpose
The purpose of this work was to synthesize some intelligible Russian
ut terances with the OVE 111 synthesizer(') using smoothed step commands
and ru les derived f rom a spectrographic study of symmetr ical VCV-syl-
lables pronounced by a single Russian speaker. I
The OVE 111 Synthesizer
The control pa ramete r s a re :
Branch: Formants : Anti-Formants: Excitation: Quanta1 ~ t e p s ( d ~ ) : Source :
Vowel F 1 F 2 F3 1) Voice Noise
Nasal FN 3, AN 8 Voice
Fricat ive K 1 K 2 AI< 2, AC 4 Noise
1) F4 and F5 a r e kept constant.
2 ) The anti-formant frequency is given by AK in dB rel . K i ( + i 2 d ~ / o c t ) and cancels i t when AK i s 20 dB.
3) Single formant branch a lso used for non-nasal segments, e. g. the voice bar of voiced consonants.
The Synthesis P rogram ( 2 1
The control parameter data a r e s t ~ r e d in the computer on a phoneme
basis. Each data f r ame (phoneme), denoted by a one- o r two-character code
(e. g. S for [ s J , SH fo r [I] etc. ), contains specification of value and onset
time (=one data point) for each of the control pa ramete r s used. I t a lso con-
tains information of the total phoneme duration; parameter DRW.
One o r m o r e data points/control parameter in the phoneme can be spec-
ified. Any number of control pa ramete r s can be defined in a phoneme.
Thus in many cases some pa ramete r s a r e not specified a t all. The utterance 1
$6 Guest r e sea rche r f rom Lvov University, USSR, February 17 - December 15, 1970.
+*;* One DR-unit i s 20 m s e c long.
STL-QPSR 1/197i 44.
to be synthesized is given by a s t r ing of phoneme-codes (ASHA for [afa],
etc. ). Thus, fo r each control parameter a sequence of data points is a s -
sembled.
These a r e interpreted a s step commands; i. e. a parameter value is
adopted a t the t ime it i s specified and then i t is kept a t this value until a
new data point i s given and a t that very moment i t jumps to the new value.
Data points f r o m one phoneme overr ide those for the following one, if an
overlap in the t ime direction occurs , see Fig. 11-A-7a.
The se t of square -waves (one sq. w. fo r each control parameter ) i s then
smoothed in a digital lowpass fi l ter p r io r to output to the synthesizer. The
nominal time constant for the f i l te rs a re :
Pa ramete r s : Time constant (msec):
see the f igures , especially Fig. 11-A-7. Because of the particular read-
out speed chosen in this study the actual t ime constants were 25 O/o shorter .
The ru les a r e of two kinds:
I, Rules inherent in the syste due to the consequences of the 'smoothed square -wave philosophy' . (Pf
11. Rules re la t ing different phoneme categories (with respect to acoustic and ar t iculatory features) and control parameter values and timing. I
The r e s t of this ar t ic le will descr ibe the second kind of rules.
Phoneme Duration
The duration f o r a11 phonemes was chosen to 8 DR-units (160 msec) .
The prolongation phoneme, ":", adds 5 DR-units to the preceding vowel.
Vowels
The vowels a r e [u], [o], [a], [ e l , [i], and [ i ] . The formant frequen-
cies a r e given in Table 11-A-I. They a r e derived f r o m a spectrograpllic
study of 300 VCV-syllables, see ref. ( 3 ) . The formant data points Fi F Z F 3
a r e specified 3 DR-units before the nominal s t a r t of the phoneme (where A0
AO, A H
AO, A N n n 1:)d-i i\ [-)d-l 8 0
a s a a z a a t a a d a
F i g . 11-A- I . TIMING PATTERNS FOR EXCITATION PARAMETERS. (smoothed s t e p c o m m a n d s ) A0 is shown together with AC, AH, and AN r e s p . The m a n n e r of a r t i cu la t ion d e t e r - m i n e s the t iming p a t t e r n f o r the excitat ion p a r a m e t e r s. The d i f ferent t iming p a t t e r n s poss ib le a r e exempl i f ied by the den ta l s in CaCal -context . AH i s a lways swi tched or1 a t the s a m e t ime a s AC and i s kept constant throughout the phoneme. AN is swi tched
- on a t the beginning of a l l voiced consonants and is kept constant throughout the phon- e m e . The ac tual va lues a r e given in Table 11-A-3.
STL-QPSR 1 / 1 9 7 1
Table 11-A-3. AC, AN, and AH leve l s ( d ~ ) . F o r A0 l e v e l s , s ee Fig. 11-A- I.
P l a c e of ar t icula t ion
The t a r ge t values for the f requency co n t r o l - pa r ame te r s have been i der ived f r o m s p e c t r o g r a m s o i symmet r i c a l VCV-syllables, whe re C
I I
i s voiced and vo ice less f r i ca t ives ; [s] , [ z ] t a r g e t s have been u sed for
the denta ls etc. See Fig . 11-A-2. I
+
Rules:
~ ~ ( f r icative/sto@= = t 4 dB
~ ~ ( v o i c e l e s s/voiced) = = t 4 dB
AC (s t rong/weak)= = t 8 d B
I
Conso-
can t s :
Voiceless
Voiced
k H t ~ F2
2,2 - 290 - 1.8 - 1.6 - 194 - 192
be I 0 ,8 6 , l o l 1 1 I ) ,
0 .I .2 .3 4 -5 -6 .7 .8 .9 kHz
b
AN
0
8
Fig. 11-A-2. F i r s t and second fo rman t t r a c e s in [VzV]. The s ta t ionary p a r t of the f i r s t vowel and the middle of the [ z ] - segment give the two points on each line. A mean of the [ z ] - points was used in the f i r s t approach a s F I - F 2 t a rge t va lue for the dental group of consonants.
AH
1,ab. 12
Dent.
Alv. 20
Vel. 28
AC
Strong Weak
Stops
8
p k
4
b g
F r i c .
12
f 1 1
8
v
F r i c .
2 0
s . J
16
" 3
Stops
16
12
d
STL-QPSR 1/197 1
Manner of Articulation
The timing pattern for AC i s the same within the fricatives, merely the
levels differ; Fig. 11-A- l a and 11-A- 15. A different timing pattern i s used
within al l stops except for and [g], where the burst i s longer; thus the
AC-pattern s tar ts 1 DR-unit ear l ier than in the other stops. See Fig. 11-A- i c
and 11-A- id.
AH i s always switched on a t the same time a s AC and i s kept constant
throughout the remaining par t of the phoneme, (Fig. 11-A- I).
There a r e two different timing patterns for AO: one for voiceless and
one for voiced fricatives and stops, (Fig. 11-A- I).
AN, finally, i s switched on a t the beginning of all voiced fricatives and
stops and i s kept constant throughout the phoneme.
General Build-Up Procedure used for the Consonants
The voiceless fricatives [f], Cs], [I], and [h] were the f i r s t ones to
be constructed. F rom these the voiced counterparts (with respect to place
of articulation): [v] , [zl, and c 3 ] were made. [h] does not have a voiced
counterpart in Rus sian.
The voiceless stops were constructed from the corresponding voiceless
fricatives by switching off AC and AH in the f i r s t par t of the phoneme.
The voiced stops, finally, a r e made from the voiceless stops in the
same way a s the voiced fricatives a r e made from the voiceless fricatives.
Lists with synthetic VCV-words (C=fricative s and stops) have been pro - duced. After analysis of the resul ts from several listening tests with native
l is teners the formant and timing data have been changed to the values given
in this text; i. e . , an iterative procedure has been used to match our pho-
neme-data to the rule system.
Detailed Description of Fricatives and Stops
Voiceless Fricatives (Vl. F.J: A0 and AC a r e switched off r e sp. on in --------------- two stops (24 to 12, 12 to 0 d ~ ) a t the beginning and a t the end of the Vl. F. - The time constants of the digital f i l ters (see p. 44) used for smoothing A0
and AC i s too short for switching these parameters on and off in only one - step (from 24 to 0 dB) infr icat ives, (see Fig. 11-A-la). The V1.F. have
been divided into two subgroups: strong [s l , [f] and weak [f j, [hj with
STL-QPSR 1/1971 48.
respect to the required AC-level. F o r the strong group i t i s 8 dB higher
than that for the weak, i. e. : AC (strong/weak) = +8 dB. AH is kept constant
during the phoneme. Fig. 11-A- 3 shows natural and synthetic [aCa] -words,
C=Vl. F.
Voiced Fr ica t ives LV. F.): These a r e constructed f rom the V1. I?. , see ---------- -- Fig. 11-A-lb. The voicing i s accomplished by (a) Switching A0 off la te r
and on ea r l i e r in the V. F. than in the corresponding V1. F. Compare Fig.
11-A-la and 11-A- ib. (b) Switching on the nasal branch to provide a voice
bar (AN = 8 dB, FN = 250 HZ). The tense/lax component of the V. / ~ 1 .
distinction i s realized by the rule: AC(V1. )/Ac(v. ) = t 4 dB (Table 11-A-3).
Fig. 11-A-4 shows natural and synthetic [aCa] -words, C=V. F.
Voiceless Stops CV1. s.1: [p], [t ], and [k ] a r e constructed f rom the ------- - -- corresponding V1. F. [f I , [s], and [h] r e sp. The Vl. S. (Fig. 11-A- l c )
begin with a pause (6 DR-units long in [p] and [t] and 5 in [k] ) af te r which
the burst s t a r t s , i. e. AC and AH a r e switched on simultaneously. After
i DR-unit AC is switched off but AH remains on. Fr ica t ives a r e given 4 dB
higher AC-level than the corresponding stops, see Table II-A-3. Fig. 11-A-5
shows natural and synthetic [aCa] -words, C=Vl. S.
Voiced s tops 1V. s.1: The general V1. F. /v. F. relations hold fo r ----- - -- Vl.S./v.S. also. Thus the voicing in V.S. was achieved by adding the
"nasal" voice bar . AN a s in V. I?. but F N = 200 Hz, see Fig. 11-A- id and
Table 11-A-3. Fig. 11-A-6 shows natural and synthetic [aca lwords , C=V. S .
Palatalization
The soft-hard relation in the synthesis has been based on a study by
D e r k a ~ h ( ~ ) , and i s achieved by inserting a palatalization command, i. e. a
zero-duration phoneme with [i] -formants before the consonant to be palata-
lized. These formant- targets which override those of the consonant, a r e
reached in the consonant and kept there until the following vowel s ta r t s .
The transit ions will thus be delayed compared with a hard CV-segment, see
Fig. 11-A-7. Palatalization i s denoted by "' " af te r the palatalized consonant.
Spectrograms of natural and synthetic palatalized [aC' a ] -words (C=dental
consonant) a r e shown in Fig. 11-A-8. No confusions about the hard/soft
distinction in VCV-syllables were made in a tes t with 4 Russian l is teners .
NATURAL -kc,
7- - 6 i - 5 4
SYNTHETIC kc- 7- - 6- ' - 5- -
kc, . ?-
asa 3 f -
I I
k c - ' . .
1-; - 6-
I
kc, 3- - 6- - 5- -
kc,
kc, 7- - 6- - I
Fig. 11-A-3. Spectrograms of natural and synthetic [aCa] -words with the voiceless fricatives [f], [s]. , and [h].
kc, NATURAL kc - SYNTHETIC 7- --- - -- -. -- -- -- . - . 7- -
--- - - - - -- - - 6- 6- -
- 5- I - - 4- 4 -. -
1 i 1111" . Q
3- - -, - 2- - 1- -
4 1
o ~ l ~ t ~ I ~ l ~ ~ I ~ ' ~ - l -
0 .1 .2 .3 .L .S .6 .7
k c , , ,
7*
11
' I '
kc -' -- -..- A
k c - . - - - . 7-1 : 7-
Fig. 11-A-4. Spectrograms of natural and synthetic [aCa] - words with the voiced fricatives [v], [I, and [j 1.
aba
ada
kc - NATURAL 7- - 6- - 5 l
kc- 7- - 6- - 5- -
kc, SYNTHETIC
kc - 7- - 6- - 5- - 1 L- - $ 4 ( 1 I I Ill
k c , 7-! 1
6- 4 5" *
4-1
' I ' -7
Fig. II-A-6. Spectrograms of natural and synthetic [aCa]- words with the voiced stops [b], [dl, and [g].
Fig. 11-A-7. HARD -SOFT DISTINCTION. Note the [ i ]-like formant frequency positions in the consonant and the delayed transit ions in the soft segment compared to the hard one. The palatalization command is inserted before the consonant. The command has two data-points for each of F l , F 2 , and F3. The two a r rows in Fig. 11-A-7a a r e pointing a t the two data-points of Fl. F o r clarity the Fl transit ions in Fig. 11-A-7b has been drawn in dashed lines in Fig. 11-A-7a.
NATURAL
as'a
kc, 7- - 6- - 8- - 4- - 3- - 2- - 1-
4 0
Fig. 11-A-8. Spec t rog rams of n a tu r a l and synthetic pala ta l ized [aC' a> words with the dental consonants is], [z], [ t 1, and [dl.
d t e : v u f k a k a: k t ' e b ' a : z a v u: t -
Fig. 11-A-9. Spectrogram of the sentence [d' e:vujka ka:k t' eb' a: zavu:t] (Young girl, what i s your name ? )
STL-QPSR 1/1971
Results
13 sentences have been synthesized. One of them i s "Young gir l what i s
your name ? I t . In t ranscript ion [dJe:vd,fka 1ia:k tbb'a: zavu:t]. Spectrogram
of this sentence i s shown in Fig. II-A-9.
Conclusions
The program philoazphy and the rules described makes i t possible to
construct very compact phoneme data f rames . PI and [g] can be improved
by introducing back and front variants. The palatalization may sound exag-
gerated to non-Ukranian ears . This can be improved by a change in the
timing of the palatalization command. At present the system design imposes
restr ic t ions on the rate of formant transitions by a p rese t time constant for
each control parameter . This is one of the limitations which may be removed
in future developments if the quality shall be improved.
Acknowledgments
We wish to thank Johan Liljencrants, the originator of the square-wave
synthesis technique, whose ideas and programming made this work pos sible.
Many valuable discussions with Gunnar Fant gave u s new aspects on many
problems. We a r e also indebted to Si Felicetti for her editorial help, to I
Gunnel Karlsson who made the drawings and mountings, and to other friends l
and colleagues in this laboratory for their cooperation. \
References: 1 ( I ) J. Liljencrants: "The OVE I11 Speech Synthesizertt , LEEE Trans.
IW- 16, No. I (i968), pp. 137- 140; also in STL-QPSR 2-3/1967, pp. 76-81.
(2) J. Liljencrants: "Speech Synthesizer Control by Smoothed Step Functions", STL-QPSR 4/1969, pp. 43 -50.
(3) Af. Derkach: "Phoneinc Coarticulation in Russian Hard and Soft VCV- Utterances with Voiceless Fricatives", STL-QPSR 2-3/1970, pp. 1-7.