use of nonsense-syllable mimicry in the study of prosodic phenomena

3
Use of nonsense-syllable mimicry in the study of prosodic phenomena a) Mark Y. Libermanand Lynn A. Streeter Bell Laboratories, Murray Hill New Jersey 07974 (Received 23 May 1977) The technique of nonsense-syllable mimicry of natural utterances hasmany advantages in the study of prosodic phenomena, especially duration. In analytic studies, the elimination of segmental effects as a factor makes data collection much more efficient,and requires only one segmentation criterion. In perceptual studies, thetechnique eliminates lexical information without unnatural distortions of thesignal. In a series of validation experiments, we have found that (1) the patterns of duration obtained by using thistechnique were stable and reproducible withinand across speakers; and (2) mimicry of different natural models with identical stress patterns and constituent structures produced nearly indistinguishable nonsense-syllable durationpatterns. PACS numbers:43.70.Gr, 43.70.Ve, 43.70.Qa INTRODUCTION In studying the timing of segments and syllables in speech, and the relationship of such durational patterns to linguistic descriptions, a number of problems arise. First, duration is influenced simultaneously by many factors, including the intrinsic duration of segments, the effects of segmental context, the stress pattern of the utterance, 'and its constituent structure. (See Klatt, 1976, for a recent review of the duration literature. ) There results a multiplication of possibilities which makes systematic study of any one factor, while con- trolling for all others, extremely difficult. Second, it is difficult to specify nonarbitrary segmentation criteria in general, and nearly impossible to segment at all in some cases; we don't really know what the true dimen- sions of speech timing are, either in production or percep- tion. In an attempt to circumvent these difficulties, some phoneticians have used nonsense-syllable strings rather than normal utterances. Oiler (1973) and Lehiste (1975) have used nonsense-syllable strings to study stress ef- fects in English. Also, Lindblom (1968) as well as Erickson and Rapp (1973) have used nonsense syllables to study durational patterns in Swedish. Erickson and Rapp used a technique in which a natural utterance, such as "this is an utterance," is imitated by substituting some nonsense syllable, suchas [ma], for each syllable in the original utterance, i.e., "mama ma marearea." This technique of imitation by constant syllable sub- stitution has a great deal of potential value in the study of prosodic phenomena in general, since it appears to eliminate inherent segmental variability entirely, leav- ing the durational pattern subject only to the influence of factors such as stress and constituent structure. Furthermore, the criteria of segmentation need be specified only once, so that even if these criteria are psychologically arbitrary, one can hope that the effect on the overall pattern will be minimal. An additional , , a)Based on a paperpresented at the 92nd Meeting of the Acous- tical Society of America, San Diego, November 1976 [J. Aeoust. Soe. Am. 62, S27(A) (1976)]. practical benefit is that the segmentation process can be automated without great difficulty, removing both ex- perimenter bias and experimenter boredom as sources of error, and expediting the data collection process. We have been using this technique of mimicry by con- stant syllable substitution (for which Nakatani has coined the name "reiterant speech") in modelingprosodic in- fluences on duration. Although reiterant speech has great potential in this area, as we have said, there are a number of quite legitimate questions that can be asked about the assumptions which underlie its use. We wish to provide some evidence in answer to two of these questions: (1) Can speakers really achieve accurate control over reiterant speech productions, i.e., are the results reliable? and (2) Do the same stress pattern and constituent structure produce the same "mama" dt•rational pattern, regardless of the segmental makeup of the target utterance ? I. METHOD In our studies, the speaker being recorded read a target sentence from a card, using a normal speaking rate and a declarative intonation pattern, and then, after a suitable pause, imitated the target sentence by sub- stituting a[ma] for each syllable in the sentence while attempting to preserve the rhythm and intonation of the target sentence. After all the utterances in the experi- mental set had been produced, the cards were shuffled and the process repeated (a total of ten times) to obtain the ten tokens of each target sentence and each "mama" imitation to be averaged. In the data to be reported the speakers were the two authors. Durations in the target utterances were measured using a computer waveform editor (Nakatani, 1977). The target utterances were constructed so as to maxi- mize ease of segmentation with the waveform editor. Word boundaries and syllable boundaries were marked by either stop consonants, nasals, or fricatives. Thus each "syllable boundary"was identified acoustically as the transition between open(vowels) and closed (stop, nasal, fricative) states of the vocal tract, or by the boundary between two acoustically distinct kinds of closure (e.g., stop/nasal). Durations of the "mama" 231 J. Acoust. Soc. Am. 63(1), Jan.1978 0001-4966/78/6301-0231500.80 ¸1978 Acoustical Society of America 231 Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 128.240.225.44 On: Sat, 20 Dec 2014 04:01:11

Upload: mark-y

Post on 14-Apr-2017

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Use of nonsense-syllable mimicry in the study of prosodic phenomena

Use of nonsense-syllable mimicry in the study of prosodic phenomena a)

Mark Y. Liberman and Lynn A. Streeter

Bell Laboratories, Murray Hill New Jersey 07974 (Received 23 May 1977)

The technique of nonsense-syllable mimicry of natural utterances has many advantages in the study of prosodic phenomena, especially duration. In analytic studies, the elimination of segmental effects as a factor makes data collection much more efficient, and requires only one segmentation criterion. In perceptual studies, the technique eliminates lexical information without unnatural distortions of the signal. In a series of validation experiments, we have found that (1) the patterns of duration obtained by using this technique were stable and reproducible within and across speakers; and (2) mimicry of different natural models with identical stress patterns and constituent structures produced nearly indistinguishable nonsense-syllable duration patterns.

PACS numbers: 43.70.Gr, 43.70.Ve, 43.70.Qa

INTRODUCTION

In studying the timing of segments and syllables in speech, and the relationship of such durational patterns to linguistic descriptions, a number of problems arise. First, duration is influenced simultaneously by many factors, including the intrinsic duration of segments, the effects of segmental context, the stress pattern of the utterance, 'and its constituent structure. (See Klatt, 1976, for a recent review of the duration literature. ) There results a multiplication of possibilities which makes systematic study of any one factor, while con- trolling for all others, extremely difficult. Second, it is difficult to specify nonarbitrary segmentation criteria in general, and nearly impossible to segment at all in some cases; we don't really know what the true dimen- sions of speech timing are, either in production or percep- tion.

In an attempt to circumvent these difficulties, some phoneticians have used nonsense-syllable strings rather than normal utterances. Oiler (1973) and Lehiste (1975) have used nonsense-syllable strings to study stress ef- fects in English. Also, Lindblom (1968) as well as Erickson and Rapp (1973) have used nonsense syllables to study durational patterns in Swedish. Erickson and Rapp used a technique in which a natural utterance, such as "this is an utterance," is imitated by substituting some nonsense syllable, such as [ma], for each syllable in the original utterance, i.e., "mama ma marearea."

This technique of imitation by constant syllable sub- stitution has a great deal of potential value in the study of prosodic phenomena in general, since it appears to eliminate inherent segmental variability entirely, leav- ing the durational pattern subject only to the influence of factors such as stress and constituent structure.

Furthermore, the criteria of segmentation need be specified only once, so that even if these criteria are psychologically arbitrary, one can hope that the effect on the overall pattern will be minimal. An additional

, ,

a)Based on a paper presented at the 92nd Meeting of the Acous- tical Society of America, San Diego, November 1976 [J. Aeoust. Soe. Am. 62, S27(A) (1976)].

practical benefit is that the segmentation process can be automated without great difficulty, removing both ex- perimenter bias and experimenter boredom as sources of error, and expediting the data collection process.

We have been using this technique of mimicry by con- stant syllable substitution (for which Nakatani has coined the name "reiterant speech") in modeling prosodic in- fluences on duration. Although reiterant speech has great potential in this area, as we have said, there are a number of quite legitimate questions that can be asked about the assumptions which underlie its use. We wish to provide some evidence in answer to two of these questions: (1) Can speakers really achieve accurate control over reiterant speech productions, i.e., are the results reliable? and (2) Do the same stress pattern and constituent structure produce the same "mama" dt•rational pattern, regardless of the segmental makeup of the target utterance ?

I. METHOD

In our studies, the speaker being recorded read a target sentence from a card, using a normal speaking rate and a declarative intonation pattern, and then, after a suitable pause, imitated the target sentence by sub- stituting a[ma] for each syllable in the sentence while attempting to preserve the rhythm and intonation of the target sentence. After all the utterances in the experi- mental set had been produced, the cards were shuffled and the process repeated (a total of ten times) to obtain the ten tokens of each target sentence and each "mama" imitation to be averaged. In the data to be reported the speakers were the two authors.

Durations in the target utterances were measured using a computer waveform editor (Nakatani, 1977). The target utterances were constructed so as to maxi- mize ease of segmentation with the waveform editor. Word boundaries and syllable boundaries were marked by either stop consonants, nasals, or fricatives. Thus each "syllable boundary" was identified acoustically as the transition between open (vowels) and closed (stop, nasal, fricative) states of the vocal tract, or by the boundary between two acoustically distinct kinds of closure (e.g., stop/nasal). Durations of the "mama"

231 J. Acoust. Soc. Am. 63(1), Jan. 1978 0001-4966/78/6301-0231500.80 ¸1978 Acoustical Society of America 231

Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 128.240.225.44 On: Sat, 20 Dec 2014 04:01:11

Page 2: Use of nonsense-syllable mimicry in the study of prosodic phenomena

232 M.Y. Liberman and L. A. Streeter: Study of prosodic phenomena 232

SPEAKER 1

(a) ORIGINALS "MAMA" IMITATIONS:

MEAN DIFFERENCE -- 154 msec MEAN DIFFERENCE = 6 msec

DIFFERENCE (msec) DIFFERENCE (msec) 368273177177153132727070.47 310-2.11625-4169-4

600f 600 r 5OO 500 I

•oo •øøI 200 2oo I 100' 100'1 1 [ 2[3 14 I 51 Sl ?l el 911o[

OR•.AL SYLLABLE/SYLLABLE CO. PAR•ON: CORRESPO. D• MA/UA CO•PA•SO. TE, LA,GEST mFFERE,CES.

(b) SPEAKER 2

ORIGINALS "MAMA" IMITATIONS:

MEAN DIFFERENCE =123 msec MEAN DIFFERENCE=17 msec

DIFFERENCE (msec) DIFFERENCE (msec) 324 269 153 92 88 83 69 68 51 37 53 39 39 -3 5 -2 32 13 3 -5

:øoøof I z

100 ' 100' 01 111•131415161 7181•1101

OR•G•,AL SVLLA,LE•SVLLA,LE COMPAR,SON CORRESPO,D,,G MA•MA COMPAR,SO, TEN LARGEST DIFFERENCES

FIG. 1. Mean durations of syllables in original utterances and corresponding "mama" imitations for (a) Speaker 1 and (b) Speaker 2. Small solid bars represent plus and minus one standard error of the mean.

versions were measured automatically by a computer pattern-recognition technique, based on the voiced-un- voiced-silence decision algorithm developed by Atal and Rabiner (1976), and modified to decide between the three categories [ m], [a], and silence.

II. RESULTS

The first point that must be made to establish the re- liability of the reiterant-speech technique as a method for exploring durational patterns is that resultant mama durations are stable and reproducible across repeated mimicry of the same target utterance. For one subject, we have found that the average standard error for ten repetitions, by syllable position, was less than 5 ms, across three studies including 19 different utterances. For a second speaker, the average standard error for a similar set of data was about 8 ms. Thus, the technique is quite sensitive, since differences between mean dura- tions of as little as 10 or 20 ms are often statistically significant.

The question immediately arises: What effects are being measured with such sensitivity? We find that the segmental makeup of the target utterance has little or no effect on the durations of the corresponding "ma" syllables, which (we hypothesize) are primarily deter- mined by the stress pattern and constituent structure of the target utterance.

To test this point, we made up four pairs of target sentences, each pair having identical stress and con-

stituent structure, but very different syllabic durations due to segmental effects. An example of one such pair is'

"This bit may cut our pretty colt."

"This cold may freeze our grungy seeds."

Ten tokens of each of these eight utterances were re- corded, along with eight utterances selected as foils, by each of the two speakers. The results for Speaker 1 are shown in Fig. l(a). The left panel of Fig. l(a) shows those target syllable pairs which showed the greatest dif- ferences in mean duration, while the right panel shows the corresponding "mama" pairs. For Speaker 1 seg- mental differences in the target utterance were uncor- related with the corresponding "mama" differences (r =- 0.14). Figure l(b) shows similar data for Speaker 2. For Speaker 2 there is a significant correlation be- tween the magnitude of the differences between target syllables and the corresponding "mama" duration dif- ferences (r=0.82, p<0.01). However, for both speak- ers differences in the original target pairs were sub- stantially reduced in the "mama" imitations. For Speak- er 1 the mean difference in the original target pairs was 154 m, while the average difference in the correspond- ing mama pairs was 6 ms. For Speaker 2 the two respective mean differences were 123 and 17 ms. It should be noted that these data reflect only the ten larg- est differences in the original target syllable pairs. In- deed, all four sentence pairs were constructed so as to maximize segmental differences in more positions than the ten syllable pairs shown here.

A typical set of durational profiles for the utterance pairs in this experiment are given in Fig. 2. Figure 2 shows the durations of syllables in the target utterances "This bit may hurt our pretty colt," and "This cold may freeze our grungy seeds," and the corresponding reit- erant imitation for both speakers. For comparison with these minimal segmental effects in the "mama" imita-

ORIGINALS "MAMA" IMITATIONS

,oo / ,oo ool- ; ooj- / ;, / SPEA•ER • -ø i- / ",, i,, /_,oo 3oo l- / ,, ,, ',, !/ •oo•

[ !' •/"'"•,..,•. ,,.. •/ / ,'/•,

•?,1 2 3 4 5 6 7 8 100• ,oo• ,oo- 600• 600- 500• 500 - •oo• A / •oo - S,EA•E.

L ,'X ,' // 300l / •x • _ x // 300-

,ooZ ,oo 10011 I I i i I I I 100 • i I I I I 1 2 3 4 5 6 7 8 1

SYLLABLE POSITION

•G. 2. Averse dut•[o• p•t• of Cwo otiEi•[[• u•ted service8 •d Che •vet•e dut•C[o• p•t•8 of Che •o

J. Acoust. Soc. Am., Vol. 63, No. 1, January 1978

Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 128.240.225.44 On: Sat, 20 Dec 2014 04:01:11

Page 3: Use of nonsense-syllable mimicry in the study of prosodic phenomena

233 M.Y. Liberman and L. A. Streeter: Study of prosodic phenomena 233

300

250

z 200 _o

150

100 i i i i i i i i ! 1 2 3 4 5 6 7 8 9 10

SYLLABLE POSITION

FIG. 3. "Mama" duration patterns for three sentences. The first two sentences differ prosodically from the third with re- spect to stress placement on the first word. Solid line: "Cun- ning scholars deciphered the tablets." Dot dashed line: "Thir- teen teachers were furloughed in August." Dashed line: "In- tense actors are bothered by coffee."

segmental influences, and should exercise some care in interpreting "mama" differences in cases in which there are very large segmentally influenced durational differ- ences in the target utterances. With these caveats, we feel it is fair to proceed under the assumption that reit- erant speech yields a clear view of prosodic patterns.

Reiterant speech appears to be a powerful tool for de- termining how stress and constituent structure affect ob- served durational patterns not only in a qualitative way, but also in a precise quantitative fashion. In addition, the method is ideally suited for studying perceptual ef- fects of prosodic variables. The nemesis of many such perceptual studies is that the meaning of a word or group of words may override significant but less salient factors in the subject's response. Nakatani and Schaffer (1978) have used the reiterant speech technique to iso- late some prosodic cues that influence stress percep- tion. Thus, reiterant speech has promise for describ- ing prosodic influences on both production and percep- tion.

tions, note the rather large effect obtained (in a differ- ent experiment) merely by shifting a secondary stress over by one syllable (Fig. 3). Figure 3 shows "mama" imitations of three. target sentences: (1) "Cunning schol- ars deciphered the tablets," (2) "Thirteen teachers were furloughed in August," and (3) "Intense actors are both- ered by coffee." Note that the duration pattern is changed radically by the pattern of stress: "cunning" and "thirteen" have, greater stress on the first syllable, whereas "intense" has greater stress on the second syl- lable.

Ill. DISCUSSION

We conclude that the effects of segmental variation in the target utterance of durations of the "mama" imita- tions are generally quite small. However, we suggest that those wishing to use this technique should test each of their speakers for the existence and magnitude of such

Atal, B. S., and Rabiner, L. R.. (1976). "A pattern recogni- tion approach to voiced-unvoiced-silence classification with applications to speech recognition," IEEE Trans. Acoust. Speech Signal Process. 24, 201-212.

Erickson, Y., and Rapp, K. (1973). Cited in B. Lindblom and K. Rapp, Publication No. 21, Institute of Linguistics, University of Stockholm (1973).

Klatt, D. H. (1976). "Linguistic uses of segmental duration in English,"J. Acoust. So½. Am. 59, 1208-1221.

Lehiste, I. (1975). "Some factors affecting the duration of syllabic nuclei in English, "Proceedings of the First Salzburg Conference on Linguistics, edited by G. Drachman (Verlag Gunter Narr), pp. 81-104.

Lindblom, B. (19•8). "Temporal organization of syllable pro- duction," Speech Transmission Laboratory Q. Progr. Status Rep., Stockholm, Sweden, R. Inst. Technol. 2(No. 3), 1.

Nakatani, L. H. (1977). "Computer-aided signal handling for speech research," J. Acoust. Soc. Am. 61, 1057-1062.

Nakatani, L. H., and Schaffer, J. A. (1978). "Hearing words without words: Pros•odic cues for word perception," J. Acoust. Soc. Am. 63, 234-246.

Oiler, D. K. (1973). "The effect of position in utterance on speech segment duration in English," J. Acoust. Soc. Am. 54, 1235-1247.

J. Acoust. Soc. Am., Vol. 63, No. 1, January 1978

Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 128.240.225.44 On: Sat, 20 Dec 2014 04:01:11