intelligibility carried by speech source functions ... · intelligibility carried by speech source...
TRANSCRIPT
Dept. for Speech, Music and Hearing
Quarterly Progress andStatus Report
Intelligibility carried byspeech source functions:Implications for theory of
speech perceptionAnanthapadmanabha, T. V.
journal: STL-QPSRvolume: 23number: 4year: 1982pages: 049-064
http://www.speech.kth.se/qpsr
I STL-QPSR 4/1982
C. INTELLIGIBILITY CARRIED BY SPEECH SOURCE FUNCTIONS: IMPLICATICNS
FOR W R Y OF SPEECH PERCEFTION
Abstract
Voiced, unvoiced, plosive contrasts, intensity changes, and natural pitch variations constitute the source dynamics in speech production. I t i s generally understood tha t the phonetic message is largely con- veyed by the formant pattern whereas voice source dynamics convey mainly higher level grammatical information and personal characteris- tics. It i s shown in this paper, however, that source dynamics alone car1 convey significant phonetic information. This claim is tested on carefully extracted source dynamics for a complete utterance. A simple but approximate signal processing scheme for the extraction based on epocki filtering is described. Intelligibility tests on speech recoded to retain only the source dynamics have beer1 carried out. Implications of resul t s t o speech perception theory are discussed. This type of processing may have applications in t a c t i l e cotnmunication, cochlear implantation etc.
I . In t roduct ion
The speech signal car1 be mdelled as the output of a linear quasi-
tirne-invariant f i l t e r excited by quasi-periodic pulses or random noise
(Fant, 1960). The phonetic message in speech is mainly conveyed by the
transfer function of the f i l t e r representing the vocal t r a c t system.
For example, s i l e n t ar t iculat ion with external throat vibrator can
produce intelligible speech. Thus, research efforts in the areas, such
as the acoustic theory of speech production, acoustic-phonetics, speech
prception, speech recognition etc. mainly emphasize the study of vocal
t r a c t t ransfer function. Only a limited number of studies have been
rnade on the information content of source dynamics.
From the studies l i s t ed in Table I , it is known tha t tile prosodic
features i n speech convey information on word accent pattern, syllable
or wound bondaries, grammatical tlifferences (verb/noun; interroga-
tion/statement), emotions, and personal voice qualities. In studies of
source dynamics only or with the aid of lipreading, the pressuposition
is tha t the l i s teners guess a word from fractional prosodic and seg-
mental cues, and do not expl ici te ly hear the pho~lemes. Further, i n
these ea r l i e r studies the source dynamics as defined above are only
pa r t i a l ly retained.
The following aspects of intensity processing are rot always ap-
preciated. The usual technique employs a rec t i f ica t ion and lowpass
I I i STL-QPSR 4 /I982
Fiq. la. S ~ e c h signal of t he Swedish utterance: " lk se l l var d*".
Fig. Ib. Automatically derived inverse f i l t e r output corresponding t o the speech signal in Fig. 1 a .
STL-QPSR 4/1982 52.
f i l t e r i n g t o about 30 Hz. This method provides a measure of sound
pressure level of acoustic signals and does not necessarily reflect the
temporal contrasts of the subjective loudness level changes in speech
sounds. A suitable parameter extraction scheme should re f l ec t the
physiological e f fo r t involved in the production of speech sounds. In
case of vowel sounds, for example, this effort is related to the slope
of glottal pulse a t glottal closure. Inverse filtering experiments in
connected speech (Fant, 1979) sllow tha t the amplitude of the g lo t t a l
pulses i s nearly constant whereas the slope of g l o t t a l pulses a t clo-
sure, i.e., t e overall formant amplitude scale factor varies consider-
ably in connected speech. The voice pitch in speech is the periodicity
of a pulse t ra in. Lowpass f i l t e r ing , commonly employed in pitch ex-
traction, averages tlle rapid changes in pitch period.
The formulation of the problem, viz., whether the source dynamics
can cunvey phonetic message, may appear rather intriguing. During the
author's thesis work, an analog pitch meter based on epoch f i l t e r ing
technique was bui l t . The author noticed tha t the epoch f i l t e r output
had an auditory impression of speech. This paper is a formal follow up
of that observatiori.
In Sec. I1 we describe a carefully conducted experiment on source
dynamics based on inverse filtering. In Sec. 111, a signal processing ,
scheme fov approximately coding the source dynamics is described. In
Sec. IV, l is tening t e s t resu l t s are presented. Some applications and
discussion then follow.
11. Inverse f i l t e r i n g experiments
A Swedish utterance spoken by an adult male speaker was carefully
analyzed by the square-root-linear-prediction technique (Ananthapadna-
nabha & Chakravarthy, 1900) to auto~qatically estimate the formant fre-
quencies and bandwidths. The speech signal is then inverse f i l t e red
using the estimated formant data. The speech signal and the inverse
f i l t e red waveform are shown in Figs. l a and lb , respectively. The
spectrqrans of these signals are shown in Fig. 2. The inverse filtered
signal c lear ly has a phonetic quality. Tne or iginal utterance can
easily be identified. This is not surprising as the spectrqrarns retain
the formant transitional information. It is interesting to note that
the inverse filtering operation does not remove tile formalt structure
since (a) the formant frequency and bandwidth set t ings are not always
STL-QPSR 4 / 1 9 82
correct, and (b) the source-filter interaction (Ananthapadmanabha &
Fant, 1982) causes additional ripple cmmpnents which cannot be removed
by inverse filtering. A five parameter model for the voice source is
then matched on a period-to-period basis by an automatic analysis pro-
gram developed by the author. The result of this matching procedure is
shown in Fig. 3a. The spectrogram of this signal is shown in Fig. 4a.
A n altemat ive source function with exponentially shaped glottal pulses
and initial amplitude cwrrespondiny to the main excitation strensth and
with proper locations is also generated. The corresponding speech
signal is shown in Fig. 3b and Fig. 4b. Though the signal has a buzzy
quality the phonetic nessage is largely conveyed. Vowels are rnore
easily perceived.
The above experiment just if ies our study of phonetic information
carried by source dylarnics. However, the procedure is very complex and
time consuming. Hence, a simple but approximate method for extracting
of source dynamics is proposed in the next section. The results ob-
tained represent a conservative estimate. In actual practice the re-
sults rnay be better.
111. Signal processing scheme
In the present work we have focused or1 an accurate processing of
voiced sods. The consequence of the processing on unvoiced sounds is
only inferred. This is to simplify the signal processing requirements.
The epoch filter accepts voiced speech as an input and provides at
tlrle output pulses, whose peaks occur at the main excitation instants
(the epochs) and whose amplitudes are proportional to the strengths of
excitation of the voice source (Anant1 lapadmarlabhadmaha & Yegnanarayana, 1975,
1977, 1979). In contrast to inverse filtering which also provides
accurate voice source dynamics, epoch filter does not require an ex-
plicit knowledge of the formant frequencies and bandwidths. A simple
sdleri~e for implementing the epoch filter is shown in Fig. 5. A one-third
octave bandpass filter, centered at 4 kHz, B&K, type 2113, is used. 'Ihe
output of the bandpass filter is rectified and lowpass filtered with a
third order lowpass filter with a 3 dB-cutof f frequerlcy at 340 Hz. The
output processed signal for vowel segments is illustrated in Fig. 6.
The processed signal shows peaks at the main excitation instants. In
this simple scheme for irnplemerlting epoch filter, the amplitudes of
pulses in the output processed signal are influenced by the transfer
Fig. 4a. Spectrogram of rrodelled voice source function (Fig. 3a).
Fig. 4b. Spectrogram of nodelled exponential source function (Fig. 3b) .
Fig. 5. Signal processing s c h e used for coding source dynamics.
Fig. 6 . Vmel segment and processed signal (a) onset, (b) steady part.
processed , signal speech - BPF , ++ HPF
signal 4 kHz d
1100 Hz , +
340 HZ- AMP LPF
STL-QPSR 4/1982
&out 30% ~oiced/unvoicded discrimination was made correctly i n 85% of
the cases. Comparing these resul ts w i t h the carefully inverse f i l te red
scheme of Sec. 11, it follows tha t the l a t t e r preserves a higher recog-
nition score for vowels because of the preserice of residual fornant-like
ripple i n the g lo t t a l pulse shape fro19 source.filter interaction and a
higher overall S/N ratio.
The task of recognizing the lumbers gave the l i s teners a carnplete
contexctual knowledge. Top-down processing can be expected to be very
ac t ive . Though t h e contextual information is provided, the t a s k is
s t i l l d i f f i c u l t since r~iost of the numbers have the same l l u r ~ ~ k r of syl-
lables and nearly the sane duration. Pitch and intensi ty changes have
t o be effectively used t o achieve high scores. Five lists, each of them
containing twentyf ive numbers i n random order , each number u t t e red
twice, were used i n t h e t e s t . The recognit ion was independent of t h e
l i s t and depended only on the complexity of t h e phonetic s t ruc tu re .
This showed tha t no special training was req~ i red .
To i l l u s t r a t e t h i s aspect t h e l i s t e n e r s ' responses to t h e f i v e
numbers i n l is t no. 1 and l i s t no. 5 a r e given i n Table 11. The scores
a r e cornputed f o r t h e c o r r e c t recognit ion of t h e spoken second d i g i t s
only, i.e., i n "sexton" (16), "ton" is t h e second d i g i t , a s we l l a s f o r
correct recoynition of both the digi ts . ?he average scores for a l l the
five lists for the four l i s teners are 8O%, 78%, 76%, and 708, respectiv-
e ly , f o r t h e second d i g i t recognit ion and 54%, 46%, 46%, and 35% f o r
both the d ig i t s . The recognition was found to be bet ter for the seco1ad
digi t . The errors are mainly due t o the low frequency representatim of
fr icat ive sounds and poor vowel-nasal discrimination.
This study was motivated by the works of Risberg (1982) and Spens
(1981). I11 the development of hearing aids, it would be useful to know
the relat ive cwntributions of spectral c l p n i c s and source dynamics to
t h e speech i n t e l l i g i b i l i t y . The present s tudy has given us an idea
about the contribution of source dynamics. The i n t e l l i g i b i l i t y conveyed
by gross s p e c t r a l dynamics has been s tudied by Risberg & Agelfors
(1982). In a p a r t i c u l a r design of a s i n g l e channel t a c t i l e device
(~raunmuller, 1300), the dynamics of the center of gravity of the short-
titne spectrum is coded a s a frequency-ri~odulated s i n e wave with the
amplitude being rnodulated by t h e i n t e n s i t y of t h e speech wave. The
A. ORIGINAL
B. PROCESSED
Fig. 8. Spectrograms of the original and processed speech I of the utterance LISm TVA.
range of the frequency mdulation is kept between 30-300 Hz to match tlle
frequency response of the skin. This scheme was used in the study on
the i n t e l l i g i b i l i t y of spectral dynalnics (Risberg & Agelfors, 1982).
'=he frequency range was cl~ancjed to 250-2500 T3z and the signal was pre-
sented to the auditory mode. '=he average intell igibil i ty scare for file
number recognition task (the same as in Sec. IV) was found t o be about
50%. '=bus, it seems that the source dynamics and the spectral dynamics,
both coded crudely, contribute equally t o the i n t e l l i g i b i l i t y . How-
ever, since the temporal and spectral features are bh useful, a two-
channel device as a t a c t i l e aid, can be expected t o give a higher
per fomance. especially m~nbined with l ip reading. Since the proprio-
ceptive transduction of the skin has not evolved t o deconvolve the
speech signal, the separation of the speech signal into two para l le l
streams conveying te~nporal and spectral features appears t o be very
ai~propriate . W i t h the processed source signal (Fig. l), acting as the source to
a single resonator tuned t o the center of gravity of the spectrum, a
coded signal useful i n cochlear implantation techniques could be ob-
tained .
V . Conclusion
Listerlers nay perceive the @lonetic cmnternt of a message from m a n y
different types of ar t i f ic ial signals. For example, it is claimed that
listeners can perceive phoneme sequence frorn "sine wave speech" (~emez &
a l , 1981; Risberg & Agelfors, 1982). But the l isteners ' i n i t i a l re-
action t o "sine wave speech" i s not one of phoneme sequence. After a
considerable training listeners may s ta r t perceiving phonene sequences
over parts of "sine wave speech". Eut, in case of our recoded signal
which retains the source dynamics only, listeners, without any training
readily perceive "voice" and attempt to ascrikx? a phonetic significance.
This observation has also been reported by Dueck & al (1977). I t ap-
pears that a link has to be established t o the "language hemisphere" aM3
once such l i r i lc is present, i.e., the l i s tener is in the "speech
~ncxle", the phonemes are easily received. In case of "sine wave s-clp,
the speech mode i s not as easi ly attained as with the source dynamics
cding.
The source dynart~ics are p tent ia l ly c a m l e of mnveyjlny phonetic
cues such as (a) voiced/unvoiced/plosive contrasts, (b) sharp -yes in
pitch across phoneme boundaries (Chistovich, l9G8), (c) voiced consonants
(but not nasals) tend to have source pulses with 18 dB per octave roll-
off, as opposed to 12 d B per octave for vowels, and ((1) voice onset
timings can code the voicing infon-tation for stops. In fact, there is
so much information in the source ~iynmics that for com~ected speech,
formant information rnay in part be considered redundant. Also see
Lieberinan (1967). It is interesting to note that listeners confuse
fricatives and stops (/s/ is perceived as /k/) for the type of proces-
sing described in Sec. 11. Also, another example is that /1/ is
confused as In/. This type of confusion is similar to the errors made
by children during language acquisition.
'=he key to the evolution of language and language acquisition rnay
lie in the source dynanics.
Acknowledgments
The author wishes to thank Dr. Arne Risbery for providing the
speech material for the nutnber recognition tasks and for bringing to my
notice several references. The author wishes to tilank Prof. Gutmar Fant
for his consent to work on this ancillary problem and for his very
valuable and critical comments. Thanks are also due to my colleagues
for their cooperation.
References
Ananthapadmarlabha, 'il.V. & Fat, G. (1982): "Calculation of glottal flow and its components", Speech Conununication - 1 (1902), pp. 167-184.
Ananthapadmanabha, T.V. & Qlakravarthy , H-S. ( 1980) : "Linear prediction analysis based on square root autocorrelation", Electro Tech., Sept. 1980; also in Proc. Int. Conf. in Speech Processing (ed. P.V.S. Rao), TIFR, Bombay.
Ananthapadnanabha, T.V. & Yegnanarayana, R. ( 197 5) : "Epoch extraction of voiced speech", IEEE Trans. ASSP-23, pp. 562-570.
Ananthapahanabha, T.V. & Yegnanarayana, B. ( 1977) : "Zero phase inverse filtering for extraction of source characteristics", ICASSP, pp. 336- 339.
Ananthapadmarlabha, '7.V. & Yeynanarayana, B. (1979) : "Epoch extraction from linear prediction residual for identification of closed glottis interval", IEEE Trans. .SSP-27. pp. 309-313.
C!istovich, L.A. (1968) : "Direction of transition as a perceptual I
TABLE I. The role of source dynamics in speech perception: s m r y of some studies
Fry (1963)
Hadding-Koch & Studdert-Kennedy (1 964)
Kozhevnikov &
Chistovich ( 1 965) (see: Pickett, 1980)
Svensson (1971)
Lehiste & Wang (1 976)
Douek & a 1 (1 977)
Risberg & Agelfors (1 978)
Risberg & Lubker (1978)
-
TABLE 11. Listeners' responses for number recognition task -
SPOKEN RESPONSES
Lis t 1 : 34 34 34 54 34
39 39 38 39 39
40 17 4 0 70 14
57 55 77 76 36
18 18 15 18 15
Lis t 5: 4 2 7 2 72 7 2 7 2
64 64 6 4 6 4 64
73 4 3 73 73 73
30 30 30 30 30
84 84 8 4 84 84 i
Noun/Verb distinction
Question/Staterent
Word boundaries, mrd stress
Accent pattern
Sentence endings
Cochlear implant: stress patterns, voice, m r d recognition
Hearing aid: m r d
Speech understanding
Speech signal: duration and formants
Speech signal
Bandpass speech: 906-1141Hz
Humned speech
Inverted speech
Lipreading + Pitch (Laryngograph)
Lipreading, intensity and/o pitch (lowpass f i l t e r ing) i