intelligibility carried by speech source functions ... · intelligibility carried by speech source...

Dept. for Speech, Music and Hearing

Quarterly Progress andStatus Report

Intelligibility carried byspeech source functions:Implications for theory of

speech perceptionAnanthapadmanabha, T. V.

journal: STL-QPSRvolume: 23number: 4year: 1982pages: 049-064

http://www.speech.kth.se/qpsr

http://www.speech.kth.se

http://www.speech.kth.se/qpsr

I STL-QPSR 4/1982

C. INTELLIGIBILITY CARRIED BY SPEECH SOURCE FUNCTIONS: IMPLICATICNS

FOR W R Y OF SPEECH PERCEFTION

Abstract

Voiced, unvoiced, plosive contrasts, intensity changes, and natural pitch variations constitute the source dynamics in speech production. I t i s generally understood tha t the phonetic message is largely conveyed by the formant pattern whereas voice source dynamics convey mainly higher level grammatical information and personal characteristics. It i s shown in this paper, however, that source dynamics alone car1 convey significant phonetic information. This claim is tested on carefully extracted source dynamics for a complete utterance. A simple but approximate signal processing scheme for the extraction based on epocki filtering is described. Intelligibility tests on speech recoded to retain only the source dynamics have beer1 carried out. Implications of resul t s t o speech perception theory are discussed. This type of processing may have applications in t a c t i l e cotnmunication, cochlear implantation etc.

I . In t roduct ion

The speech signal car1 be mdelled as the output of a linear quasi-

tirne-invariant f i l t e r excited by quasi-periodic pulses or random noise

(Fant, 1960). The phonetic message in speech is mainly conveyed by the

transfer function of the f i l t e r representing the vocal t r a c t system.

For example, s i l e n t ar t iculat ion with external throat vibrator can

produce intelligible speech. Thus, research efforts in the areas, such

as the acoustic theory of speech production, acoustic-phonetics, speech

prception, speech recognition etc. mainly emphasize the study of vocal

t r a c t t ransfer function. Only a limited number of studies have been

rnade on the information content of source dynamics.

From the studies l i s t ed in Table I , it is known tha t tile prosodic

features i n speech convey information on word accent pattern, syllable

or wound bondaries, grammatical tlifferences (verb/noun; interroga-

tion/statement), emotions, and personal voice qualities. In studies of

source dynamics only or with the aid of lipreading, the pressuposition

is tha t the l i s teners guess a word from fractional prosodic and seg-

mental cues, and do not expl ici te ly hear the pho~lemes. Further, i n

these ea r l i e r studies the source dynamics as defined above are only

pa r t i a l ly retained.

The following aspects of intensity processing are rot always ap-

preciated. The usual technique employs a rec t i f ica t ion and lowpass

I I i STL-QPSR 4 /I982

Fiq. la. S ~ e c h signal of t he Swedish utterance: " lk se l l var d*".

Fig. Ib. Automatically derived inverse f i l t e r output corresponding t o the speech signal in Fig. 1 a .

STL-QPSR 4/1982 52.

f i l t e r i n g t o about 30 Hz. This method provides a measure of sound

pressure level of acoustic signals and does not necessarily reflect the

temporal contrasts of the subjective loudness level changes in speech

sounds. A suitable parameter extraction scheme should re f l ec t the

physiological e f fo r t involved in the production of speech sounds. In

case of vowel sounds, for example, this effort is related to the slope

of glottal pulse a t glottal closure. Inverse filtering experiments in

connected speech (Fant, 1979) sllow tha t the amplitude of the g lo t t a l

pulses i s nearly constant whereas the slope of g l o t t a l pulses a t clo-

sure, i.e., t e overall formant amplitude scale factor varies consider-

ably in connected speech. The voice pitch in speech is the periodicity

of a pulse t ra in. Lowpass f i l t e r ing , commonly employed in pitch ex-

traction, averages tlle rapid changes in pitch period.

The formulation of the problem, viz., whether the source dynamics

can cunvey phonetic message, may appear rather intriguing. During the

author's thesis work, an analog pitch meter based on epoch f i l t e r ing

technique was bui l t . The author noticed tha t the epoch f i l t e r output

had an auditory impression of speech. This paper is a formal follow up

of that observatiori.

In Sec. I1 we describe a carefully conducted experiment on source

dynamics based on inverse filtering. In Sec. 111, a signal processing ,

scheme fov approximately coding the source dynamics is described. In

Sec. IV, l is tening t e s t resu l t s are presented. Some applications and

discussion then follow.

11. Inverse f i l t e r i n g experiments

A Swedish utterance spoken by an adult male speaker was carefully

analyzed by the square-root-linear-prediction technique (Ananthapadna-

nabha & Chakravarthy, 1900) to auto~qatically estimate the formant fre-

quencies and bandwidths. The speech signal is then inverse f i l t e red

using the estimated formant data. The speech signal and the inverse

f i l t e red waveform are shown in Figs. l a and lb , respectively. The

spectrqrans of these signals are shown in Fig. 2. The inverse filtered

signal c lear ly has a phonetic quality. Tne or iginal utterance can

easily be identified. This is not surprising as the spectrqrarns retain

the formant transitional information. It is interesting to note that

the inverse filtering operation does not remove tile formalt structure

since (a) the formant frequency and bandwidth set t ings are not always

STL-QPSR 4 / 1 9 82

correct, and (b) the source-filter interaction (Ananthapadmanabha &

Fant, 1982) causes additional ripple cmmpnents which cannot be removed

by inverse filtering. A five parameter model for the voice source is

then matched on a period-to-period basis by an automatic analysis pro-

gram developed by the author. The result of this matching procedure is

shown in Fig. 3a. The spectrogram of this signal is shown in Fig. 4a.

A n altemat ive source function with exponentially shaped glottal pulses

and initial amplitude cwrrespondiny to the main excitation strensth and

with proper locations is also generated. The corresponding speech

signal is shown in Fig. 3b and Fig. 4b. Though the signal has a buzzy

quality the phonetic nessage is largely conveyed. Vowels are rnore

easily perceived.

The above experiment just if ies our study of phonetic information

carried by source dylarnics. However, the procedure is very complex and

time consuming. Hence, a simple but approximate method for extracting

of source dynamics is proposed in the next section. The results ob-

tained represent a conservative estimate. In actual practice the re-

sults rnay be better.

111. Signal processing scheme

In the present work we have focused or1 an accurate processing of

voiced sods. The consequence of the processing on unvoiced sounds is

only inferred. This is to simplify the signal processing requirements.

The epoch filter accepts voiced speech as an input and provides at

tlrle output pulses, whose peaks occur at the main excitation instants

(the epochs) and whose amplitudes are proportional to the strengths of

excitation of the voice source (Anant1 lapadmarlabhadmaha & Yegnanarayana, 1975,

1977, 1979). In contrast to inverse filtering which also provides

accurate voice source dynamics, epoch filter does not require an ex-

plicit knowledge of the formant frequencies and bandwidths. A simple

sdleri~e for implementing the epoch filter is shown in Fig. 5. A one-third

octave bandpass filter, centered at 4 kHz, B&K, type 2113, is used. 'Ihe

output of the bandpass filter is rectified and lowpass filtered with a

third order lowpass filter with a 3 dB-cutof f frequerlcy at 340 Hz. The

output processed signal for vowel segments is illustrated in Fig. 6.

The processed signal shows peaks at the main excitation instants. In

this simple scheme for irnplemerlting epoch filter, the amplitudes of

pulses in the output processed signal are influenced by the transfer

Fig. 4a. Spectrogram of rrodelled voice source function (Fig. 3a).

Fig. 4b. Spectrogram of nodelled exponential source function (Fig. 3b) .

Fig. 5. Signal processing s c h e used for coding source dynamics.

Fig. 6 . Vmel segment and processed signal (a) onset, (b) steady part.

processed , signal speech - BPF , ++ HPF

signal 4 kHz d

1100 Hz , +

340 HZ- AMP LPF

STL-QPSR 4/1982

&out 30% ~oiced/unvoicded discrimination was made correctly i n 85% of

the cases. Comparing these resul ts w i t h the carefully inverse f i l te red

scheme of Sec. 11, it follows tha t the l a t t e r preserves a higher recog-

nition score for vowels because of the preserice of residual fornant-like

ripple i n the g lo t t a l pulse shape fro19 source.filter interaction and a

higher overall S/N ratio.

The task of recognizing the lumbers gave the l i s teners a carnplete

contexctual knowledge. Top-down processing can be expected to be very

ac t ive . Though t h e contextual information is provided, the t a s k is

s t i l l d i f f i c u l t since r~iost of the numbers have the same l l u r ~ ~ k r of syl-

lables and nearly the sane duration. Pitch and intensi ty changes have

t o be effectively used t o achieve high scores. Five lists, each of them

containing twentyf ive numbers i n random order , each number u t t e red

twice, were used i n t h e t e s t . The recognit ion was independent of t h e

l i s t and depended only on the complexity of t h e phonetic s t ruc tu re .

This showed tha t no special training was req~ i red .

To i l l u s t r a t e t h i s aspect t h e l i s t e n e r s ' responses to t h e f i v e

numbers i n l is t no. 1 and l i s t no. 5 a r e given i n Table 11. The scores

a r e cornputed f o r t h e c o r r e c t recognit ion of t h e spoken second d i g i t s

only, i.e., i n "sexton" (16), "ton" is t h e second d i g i t , a s we l l a s f o r

correct recoynition of both the digi ts . ?he average scores for a l l the

five lists for the four l i s teners are 8O%, 78%, 76%, and 708, respectiv-

e ly , f o r t h e second d i g i t recognit ion and 54%, 46%, 46%, and 35% f o r

both the d ig i t s . The recognition was found to be bet ter for the seco1ad

digi t . The errors are mainly due t o the low frequency representatim of

fr icat ive sounds and poor vowel-nasal discrimination.

This study was motivated by the works of Risberg (1982) and Spens

(1981). I11 the development of hearing aids, it would be useful to know

the relat ive cwntributions of spectral c l p n i c s and source dynamics to

t h e speech i n t e l l i g i b i l i t y . The present s tudy has given us an idea

about the contribution of source dynamics. The i n t e l l i g i b i l i t y conveyed

by gross s p e c t r a l dynamics has been s tudied by Risberg & Agelfors

(1982). In a p a r t i c u l a r design of a s i n g l e channel t a c t i l e device

(~raunmuller, 1300), the dynamics of the center of gravity of the short-

titne spectrum is coded a s a frequency-ri~odulated s i n e wave with the

amplitude being rnodulated by t h e i n t e n s i t y of t h e speech wave. The

A. ORIGINAL

B. PROCESSED

Fig. 8. Spectrograms of the original and processed speech I of the utterance LISm TVA.

range of the frequency mdulation is kept between 30-300 Hz to match tlle

frequency response of the skin. This scheme was used in the study on

the i n t e l l i g i b i l i t y of spectral dynalnics (Risberg & Agelfors, 1982).

'=he frequency range was cl~ancjed to 250-2500 T3z and the signal was pre-

sented to the auditory mode. '=he average intell igibil i ty scare for file

number recognition task (the same as in Sec. IV) was found t o be about

50%. '=bus, it seems that the source dynamics and the spectral dynamics,

both coded crudely, contribute equally t o the i n t e l l i g i b i l i t y . How-

ever, since the temporal and spectral features are bh useful, a two-

channel device as a t a c t i l e aid, can be expected t o give a higher

per fomance. especially m~nbined with l ip reading. Since the proprio-

ceptive transduction of the skin has not evolved t o deconvolve the

speech signal, the separation of the speech signal into two para l le l

streams conveying te~nporal and spectral features appears t o be very

ai~propriate . W i t h the processed source signal (Fig. l), acting as the source to

a single resonator tuned t o the center of gravity of the spectrum, a

coded signal useful i n cochlear implantation techniques could be ob-

tained .

V . Conclusion

Listerlers nay perceive the @lonetic cmnternt of a message from m a n y

different types of ar t i f ic ial signals. For example, it is claimed that

listeners can perceive phoneme sequence frorn "sine wave speech" (~emez &

a l , 1981; Risberg & Agelfors, 1982). But the l isteners ' i n i t i a l re-

action t o "sine wave speech" i s not one of phoneme sequence. After a

considerable training listeners may s ta r t perceiving phonene sequences

over parts of "sine wave speech". Eut, in case of our recoded signal

which retains the source dynamics only, listeners, without any training

readily perceive "voice" and attempt to ascrikx? a phonetic significance.

This observation has also been reported by Dueck & al (1977). I t ap-

pears that a link has to be established t o the "language hemisphere" aM3

once such l i r i lc is present, i.e., the l i s tener is in the "speech

~ncxle", the phonemes are easily received. In case of "sine wave s-clp,

the speech mode i s not as easi ly attained as with the source dynamics

cding.

The source dynart~ics are p tent ia l ly c a m l e of mnveyjlny phonetic

cues such as (a) voiced/unvoiced/plosive contrasts, (b) sharp -yes in

pitch across phoneme boundaries (Chistovich, l9G8), (c) voiced consonants

(but not nasals) tend to have source pulses with 18 dB per octave roll-

off, as opposed to 12 d B per octave for vowels, and ((1) voice onset

timings can code the voicing infon-tation for stops. In fact, there is

so much information in the source ~iynmics that for com~ected speech,

formant information rnay in part be considered redundant. Also see

Lieberinan (1967). It is interesting to note that listeners confuse

fricatives and stops (/s/ is perceived as /k/) for the type of proces-

sing described in Sec. 11. Also, another example is that /1/ is

confused as In/. This type of confusion is similar to the errors made

by children during language acquisition.

'=he key to the evolution of language and language acquisition rnay

lie in the source dynanics.

Acknowledgments

The author wishes to thank Dr. Arne Risbery for providing the

speech material for the nutnber recognition tasks and for bringing to my

notice several references. The author wishes to tilank Prof. Gutmar Fant

for his consent to work on this ancillary problem and for his very

valuable and critical comments. Thanks are also due to my colleagues

for their cooperation.

References

Ananthapadmarlabha, 'il.V. & Fat, G. (1982): "Calculation of glottal flow and its components", Speech Conununication - 1 (1902), pp. 167-184.

Ananthapadmanabha, T.V. & Qlakravarthy , H-S. ( 1980) : "Linear prediction analysis based on square root autocorrelation", Electro Tech., Sept. 1980; also in Proc. Int. Conf. in Speech Processing (ed. P.V.S. Rao), TIFR, Bombay.

Ananthapadnanabha, T.V. & Yegnanarayana, R. ( 197 5) : "Epoch extraction of voiced speech", IEEE Trans. ASSP-23, pp. 562-570.

Ananthapahanabha, T.V. & Yegnanarayana, B. ( 1977) : "Zero phase inverse filtering for extraction of source characteristics", ICASSP, pp. 336- 339.

Ananthapadmarlabha, '7.V. & Yeynanarayana, B. (1979) : "Epoch extraction from linear prediction residual for identification of closed glottis interval", IEEE Trans. .SSP-27. pp. 309-313.

C!istovich, L.A. (1968) : "Direction of transition as a perceptual I

TABLE I. The role of source dynamics in speech perception: s m r y of some studies

Fry (1963)

Hadding-Koch & Studdert-Kennedy (1 964)

Kozhevnikov &

Chistovich ( 1 965) (see: Pickett, 1980)

Svensson (1971)

Lehiste & Wang (1 976)

Douek & a 1 (1 977)

Risberg & Agelfors (1 978)

Risberg & Lubker (1978)

-

TABLE 11. Listeners' responses for number recognition task -

SPOKEN RESPONSES

Lis t 1 : 34 34 34 54 34

39 39 38 39 39

40 17 4 0 70 14

57 55 77 76 36

18 18 15 18 15

Lis t 5: 4 2 7 2 72 7 2 7 2

64 64 6 4 6 4 64

73 4 3 73 73 73

30 30 30 30 30

84 84 8 4 84 84 i

Noun/Verb distinction

Question/Staterent

Word boundaries, mrd stress

Accent pattern

Sentence endings

Cochlear implant: stress patterns, voice, m r d recognition

Hearing aid: m r d

Speech understanding

Speech signal: duration and formants

Speech signal

Bandpass speech: 906-1141Hz

Humned speech

Inverted speech

Lipreading + Pitch (Laryngograph)

Lipreading, intensity and/o pitch (lowpass f i l t e r ing) i

intelligibility carried by speech source functions ... · intelligibility carried by speech source...

Documents