notes on stress and word accent in swedish · stl-qpsr 2-311 994 notes on stress and word accent in...

22
Dept. for Speech, Music and Hearing Quarterly Progress and Status Report Notes on stress and word accent in Swedish Fant, G. and Kruckenberg, A. journal: STL-QPSR volume: 35 number: 2-3 year: 1994 pages: 125-144 http://www.speech.kth.se/qpsr

Upload: phungthuy

Post on 09-May-2018

218 views

Category:

Documents


2 download

TRANSCRIPT

Dept. for Speech, Music and Hearing

Quarterly Progress andStatus Report

Notes on stress and wordaccent in Swedish

Fant, G. and Kruckenberg, A.

journal: STL-QPSRvolume: 35number: 2-3year: 1994pages: 125-144

http://www.speech.kth.se/qpsr

STL-QPSR 2-311 994

Notes on stress and word accent in Swedish1 Gunnar Fant and Anita Kruckenberg

Abstract This paper provides a review of our earlier work on the realisation of stress and word accents in Swedish which serves as a pame for introducing voice source parameters in prosodic analysis. Duration and Fo measures have been related to continuously scaled estimates of perceivedprominence. Intensity variations appear to be less distinctive and are largely influ- enced by the overall breathgroup pattern. Increasing stress provides a moderate overall intensity increase combined with an additional high j-equency gain associated with a more effective voice source. Special attention is devoted to the covariation of the voice source excitation amplitude E, with Fo which displays a maximum at an Fo in the subjects mid register. This has consequences for the realisation of focal accent. A consistent correlate to increasing emphasis is the increasing contrast in vowel-consonant spectrum shape and intensity which in part is related to varying degrees of articulation-voice source interaction and also affects narrowly articulated vowels. These phenomena suggest the presence of a production oriented component in stress perception.

Introduction It is well known that stress and prominence are signalled by many phonetic parameters, the relative importance of which vary with the particular language and context. In addition to the three major prosodic parameters Fo, duration and intensity, we may observe systematic expansions and reductions of segmental patterns and spectral contrasts. In recent years much effort has been devoted to studies of the human voice source. A pertinent question is whether we would gain a deeper insight in prosodic patterns by including voice source parameters in the arsenal of descriptive parameters. A part of our research has recently been directed to this area, with studies of voice source correlates to both prominence and grouping, the combined effects of which are especially apparent in phrase and sentence final positions.

It is the purpose of the present article to review some of our previous work on duration and Fo related to perceived relative prominence and more recent work on voice source parameters and intensity in a prosodic frame. The data derives from the reading of Swedish prose and supplementary test sentences. We have referred our data to discrete prosodic categories, such as conventional levels of stress and prominence and associated tonal accents, but we have also performed continuous scaling of perceived stress to be related to continuous measures of duration, intensity and Fo.

Special attention has been devoted to interaction phenomena imposed by supraglottal narrowing which reduces the transglottal pressure and, thus, the excitation strength and changes the shape of the voice source. In general, an increase of prominence causes a hyperarticulation towards closure of narrowly articulated voiced consonants and vowels and, on the other extreme, a greater opening of open vowels.

I Presented at the International Symposium on Prosody, Sept. 18, 1994, Yokohama, Japan

STL-QPSR 2-311 994

As a result, the vowel/consonant intensity contrast increases with relative emphasis. The listener's stress judgement will be guided by perceived strength of articulation rather than by intensity alone. However, a general observation is that duration and Fo are found to play a greater role than relative intensities.

Stress and duration In Swedish, the alternation between stressed and unstressed syllables produces quasi- rhythmical patterns that are language specific and depend on the particular type of text and on stylistic and individual variations in reading.

The most consistent stress correlate is duration. A stressed syllable is about 100 ms longer than an unstressed syllable of the same number of phonemes and the duration increases with the number of phonemes. Differences in inherent phoneme durations also enter but tend to be reduced with increasing syllable complexity. In addition, we have to take into account a number of factors that contribute to syllable duration, in the first place prepause and phrase final lengthening, but also grammatical word class, accent type, syllabic word structures, etc. In practice, almost all content words, but also some of the fbnction words, carry a stress. Most of these factors have been included in a model of syllable duration (Fant et al., 1992).

AJ. SYLLABLE DURATION

= DISTINCT

I 1 1 I 1 1 2 3 4

Number of p h o n e m e s PHONEMES PER SYLLABLE

Fig. 1. Duration of stressed and unstressed syllables as a function of the number ofphonemes. To the left, subject AJ in normal and distinct reading mode; to the right. subjects AJ and LN normal reading.

The backbone of the model is the growth of syllable duration with the number of phonemes determined separately for stressed and unstressed syllables in non-terminal positions. This is demonstrated in Fig. 1, of which the lel't diagram illustrates our reference speaker AJ in two speaking modes, normal and distinct. The distinctive mode is characterised by a much greater prolongation of stressed than unstressed syllables. The differences between subject AJ and LN, shown in the right hand

STL-QPSR 2-311 994

diagram, are similar to the differences between the distinct and normal reading of subject AJ. The stability of unstressed syllables in this comparison is apparent. More detailed studies of speaker specific variations in stressed/unstressed durational contrast have been reported in our earlier studies (Fant & Kruckenberg, 1989; Fant et a]., 199 1 b,e).

When quantifying the duration component of stress, we apparently need a normalisation that eliminates differences due to variations in syllable complexity. We have introduced a normalised duration, the syllable duration index Si, which is scaled so as to provide a value of Si=l for average unstressed and Si=2 for average stressed conditions. The particular value for a syllable of duration T and n phonemes is found by an interpolation or extrapolation

where Tnu and Tns are expected average unstressed and stressed values for the particular number of phonemes.

Stress induced lengthening affects consonants as well as vowels (Fant & Kruckenberg, 1989; Fant et al., 1991a,b,c; Carlson & Granstrijm, 1986). A tendency is that consonants following a stressed vowel are prolonged more than preceding consonants. In Swedish, the phonological distinction V:C versus VC:, i.e., between long vowel plus short consonant and short vowel plus long consonant is lost in unstressed positions. Lexically long and short vowels attain durations close to those of unstressed short vowels (Fant & Kruckenberg, 1989) and formant pattern contrasts are reduced (Nord, 1 986).

Durations of stress groups, i.e., of units of a stressed syllable and the following unstressed syllables, are mainly predictable from values of expected syllable durations with due consideration of phrase grouping modifications (Fant & Kruckenberg, 1989; Fant et al., 1991d). Isochrony tendencies, if present, are small. On the other hand, there are indications of multipeak distributions of pause durations which can be interpreted as a tendency of a quantised timing of the sum of pause duration plus prepause lengthening to adhere to an integer of sentence average stress group duration (Fant & kruckenberg, 1989; Fant et al., 1990, 1991d).

Perceptual scaling and duration A continuous rating of perceived stress was established from experiments (Fant & Kruckenberg, 1989) in which 14 subjects in two different sessions were asked to grade the relative prominence of syllables and words by making a pencil mark on a vertical line scaled from 0 to 30. They were told that 10 corresponded to average unstressed conditions and 20 to average stressed conditions, i.e., a direct parallel to the Si scale. The experiment gave quite consistent results with standard deviations of single ratings of the order of 3 units only. The same technique was also used in continuous grading of perceived degree of prominence of syntactic boundaries (Fant & Kruckenberg, 1989).

STL-QPSR 2-311 994

- Subjeclive word stress R, versus

20 - duration index Si o f main syllable -

IS - -

16 - -

14 - -

12 -

1 1 1 1 1 1 1 1 1 1 1 1 ~ 1 1 1 1 1 1 1 1 1

0.5 1 .O 1.5 2.0 2.5 Si SYLLABLE DURATION INDEX

Fig. 2. Perceived word stress R, in a sentence as a function o f the syllable duration index Si.

The corpus graded consisted of all syllables in a 24-word-sentence and all words in nine sentences of the standard text. A close relation was found between word response Rw and that of the dominating syllable of the word, Rs.

We found a high degree of correlation between perceived syllable prominence Rs and the syllable duration index Si

or in terms of a power function

which indicates a compression compared to the duration data. However, it is unclear to what extent this is a true psychoacoustic relation and to what extent it is caused by uncertainties and data averaging. An example of subjective word response Rw versus syllable duration index Si for a sentence appears in Fig. 2.

An analysis in terms of word class (Fant & Kruckenberg, 1989; Fant et al., 1991a) gave a range of R, variations with a maximum of 18 units for adjectives followed by nouns. numerals, verbs, adverbs, pronouns, prepositions, auxiliary verbs, conjunctions and at the low end articles scoring 9 units only. The corresponding syllable duration index values ranged from Si= 2.2 to 0.9. The main corpus of content words scored Si

STL-QPSR 2-31] 994

values greater than 1.7 and Rw greater than 1.4 while function words scored Si smaller than 1.3 and Rw smaller than 1.3.

F, correlates of perceived stress The Fo contour contributes to the relative prominence of syllables and words in terms of local word accent modulations and overlaid Fo components of focal (sentence) or emphatic stress. The concept of word accent is related to the presence of a significant Fo modulation in stressed syllables. Unstressed and the weakest forms of stressed syllables do not display a tonal accent.

The two Swedish word accents, the accent 1, also referred to as acute, and the accent 2, referred to as grave, are illustrated in Fig. 3. The top part which derives from Gosta Bruce (1977) pertains to standard Swedish as spoken in Stockholm. The partially overlapping domains of word accent, sentence accent and termination fall are indicated. Accent 1 is characterised by an HL* (high low-star) fall from an H in a preceding syllable, if present, to the L* in the beginning of the stressed syllable. Focal accent adds a final H to the Fo contour.The accent 2 is characterised by a H*L (high- star low) fall starting in the early part of the stressed vowel and continuing within the voiced part of the stressed syllable. As in accent 1, a focal accent contributes to a following H rise before the terminal fall. The complete notation for focal accent is thus

I HL*H for accent 1 and H*LH for accent 2.

ACCENT

ACCEN

s e n t e n c e 'I a c c e n t

I I l L * ,

I I 1 I s c n t c n c e a c c e n t H

I I T 1 1 l r m i n i l j u n c t u r e

1 word f a l l I I a c c e n t I I f a l l

I, I

Fig. 3. Stylised representation of Swedish word accents, at the top according to Bruce (1977) and at the bottom @om our prose reading data at four levels ofperceived word stress.

Accent 2 8 I Accent 1

Lp H rr" H* e 0 0 4 E Y 2 2 2 0 - V1

-2 0 . -4 5 VI -6

-8

I

--.

STL-QPSR 2-31] 994

We have processed data from readings of our standard prose text to derive a number of Fo-parameters suitable as correlates to perceived stress. An initial study envolving 11 males and 4 females reading a paragraph provided average accent 1 and accent 2 data (Fant & Kruckenberg, 1989). As primary measures we choose the L*H rise within the stressed syllable for accent 1, and the H*L fall within the stressed syllable for accent 2. Our choice of a logarithmic Fo scale turned out to provide comparable average values for males and females, a 2 semitone L*H rise for accent 1 and 5 semitone H*L fall for accent 2.

In another and more detailed test with our reference subject as a speaker we performed a statistical analysis of the correlation of perceived stress with Fo measures within the Bruce notation system. The results are included in Fig. 3 as stylised Fo contours for four levels of perceived prominence, Rw = 14, 16.5. 18.5 and 20 units, corresponding to weak, average, high and extra high stress level. There is an overall agreement with the Bruce prototypes. For accent 2, we adopted the magnitude of the H*L drop as the major stress correlate. At the highest stress level we observed a drop of almost an octave followed by a rise to a secondary peak H in the next or a subsequent syllable. The magnitude of the secondary peak is here of the same order as the primary H*, but in emphatic stress it can be even greater. In general, the parts of the Fo drop above and below the local mean Fo reference are approximately of equal size.

The accent 1 Fo modulation shows a quite different outline. Here the HL* fall decreases with increasing stress level. This fall is highly variable and may be completely eliminated at high stress levels by the focal accent lifting not only the following H, but also the L*, to a level above the initial H. This we have found to betypical of iambic verse reading (Kruckenberg & Fant, 1993). It is apparent that the initial HL* drop, although a typical feature of accent 1, is not useful as a quantified stress correlate (Fant & Kruckenberg, 1993), and it is anyhow not available when a sentence starts with a monosyllablic accent 1 word. The L*H rise is a better stress correlate of accent 1, but its significance level is lower than that of the main correlate of accent 2 which is the H*L fall. The final L*H of accent 1 and the LH of accent 2 apparently serve the dual function of signalling the degree of the stress and if large enough the presence of focal stress.

A simplified conclusion is that as far as Fo stress correlates are concerned, accent 1 relies on the sentence stress component only, while accent 2 has both a significant word accent component and a sentence stress component. In some sense there is therefore a support for the unmarked character of accent 1 as proposed by Elert (1968) and by Engstrand (1989). Another conclusion supported by Fig. 3 and by our experience from speech analysis is the need to introduce a continuous measure of stress. We have, thus, quantified the accent 1 Fo modulation by the relative size of the final L*H rise irrespective of the focal/nonfocal decision which is both subjective and uncertain. As already stated, the Fo-component typical of focal (sentence accent) is also present at lower levels of accentuation but reduced.

An accent 2 stress correlate intimately related to the accent 2 H*L fall is maximum steepness. On the average it was found to be of less significance than the magnitude of

STL-QPSR 2-311 994

the fall. However, in selected paragraphs it could reach a relative high ranking. We have looked into the interaction between duration and Fo as stress cues. A

conclusion is that the correlation between duration and Fo is high in accent 2 but lower in accent 1. Duration and Fo are equally good predictors of stress in accent 2 where they have a high mutual correlation. In accent 1 Fo has less predictive power than duration but has an additive complementory function which is more apparent than in accent 2.

One common feature of the two word accents is that the coefficients relating a change in syllable duration index Si or in Fo to a change in perceived stress are almost the same in accent 1 and accent 2. We may, thus, state that an Fo modulation of 1 semitone, i.e. 6%, is equivalent to a duration increase of 20 ms and an increase of Rw=0.8 units along our perceptual stress scale, the latter approaching the level of uncertainty of stress judgements. These duration, pitch and prominence quanta are thus close to a perceptual difference limen and they cover about one sixth of the distance between average unstressed and average stressed conditions.

Extented acoustic analysis. Source features added. Fig. 4 provides an overview of the covariation of stress related phenomena within a Swedish sentence, "Han hade legat och skrivit det i en stor sal" ("He had been lying writing it in a big hall.") see also (Fant, 1993; Fant & Kruckenberg, 1993). The top line is the continuously scaled perceived stress value, Rw of each word. The three hnctions in the middle pertain to different methods of deriving the voice source excitation amplitude, E,. Curve A represents the maximum negative going amplitude from a complete continuous inverse filtering. Curve B, the middle one, is an approximation with the inverse filter tuned to the neutral vowel. Curve C below, is the negative part of the unfiltered speech wave. It is remarkable that they all show approximately the same contour. However, one difference is the tendency of the approximate inverse filtering and even more so the oscillogram contour to provide a few dB too low amplitude in low F1 vowels. A conclusion is that, at least for a male voice, we may estimate the source strength directly from the oscillogram, which of course requires a check of the polarity. But this is a matter of experience.

Below these source functions are overall speech intensity contours produced with lowpass 1000 Hz and highpass 1000 Hz prefiltering and at the bottom we find the FO contour on a log scale. The low pass channel carries a major part of the overall speech intensity and its contour largely follows that of the source amplitude E,.

An observation of general significance is the gradual decline of intensity and source amplitude within the utterance which is a common feature of a breath group. To some extent the intensity contour follows and supports the Fo declination. The utterance displayed here ends with a continuation tone. A more dramatic intensity drop is usually found at the end of a completed sentence leading up to a semantic break in which case the prepause lengthening is also reduced. Our earlier studies of perceived syllable prominence versus duration (Fant & Kruckenberg, 1989) suggest that final lengthening in part is anticipated by the listener.

Fig. 4. Assembly of speech analysis data pom the same sentence as in Fig. 2. From the top: Subjective word response R, oscillogram, spectrogram, voice excitation amplitude E, (in negative direction) derivedfiom; A proper inverse filtering, B neutral vowel inverse filtering, C the negative of the speech oscillogram followed by LPIOOO Hz and HP1000 Hz intensity and Fo.

The trading relation between duration and intensity as stress correlates need to be studied in more detail, e.g., with respect to differences in phrase terminal and medial locations.

There are several observations of general interest to be made from Fig. 4. As can be seen from the top line the major stresses lie in the verb "skrivit" and in the adjective plus noun, "stor sal" at the end of the utterance. Their associated duration and Fo correlates are apparent, e.g., the prominent accent 2 Fo fall in "skrivit" and the accent 1 Fo rise in "stor" and "sal". However, in all three words, vowel intensity and source

STL-QPSR 2-311 994

amplitude are below that of the unstressed vowel [a] at the beginning of the utterance which is explained by the overall phrase group decline of voice intensity.

The low voice source excitation amplitude Ee in the long stressed vowels [i:] in "skrivit" and the [u:] of "stor" is mainly caused by articulatory interaction reducing the transglottal pressure drop. These vowels are produced with a gesture towards a homorganic voiced consonant, which is reflected in the E,-profile. Here, the normal positive relation betweeen stress and intensity is reversed. The dynamic gesture towards closure is a strong segmental cue suggesting a motor component of stress perception related to the underlying articulatory effort. Fig. 4 also provides an example of reduction and fusion, i.e. the consonant [h] in the function word "hadeHwhich lacks a durational segment of its own. The voicing of intervocalic [h] is much reduced in sentence focal position as illustrated by Fig. 5.

100 200 300 LOO 500

I t I I

100 200 300 LOO 500 rns

Fig. 5. Oscillogram of the word "behiilla" and the associated E, contour in focal position (lep) and in prefocal position (right).

These examples support the general finding of articulation-voice source interaction in connected speech. Deemphasis causes a loss of intensity and spectral contrast between vowels and consonants, emphasis results in a sharpening of the contrast. This is a matter of contrasts both in formant patterns (Nord, 1986) and source characteristics. The extent of these variations are speaker and speaking style specific Fant et al., 1991).

Voice source parameters. The intensity of a voiced sound or of a single formant is a complex function of source characteristics and of the entire transfer function specified in terms of frequencies and bandwidths of poles and zeros. The basic source scale factor E,, is defined as the derivative of glottal flow at the instance of glottal closing discontinuity. It has recently been shown (Fant et al., 1994; Fant & Liljencrants, 1994; Fant & Kruckenberg, 1994) that essentials of the source wave form and spectrum can be predicted from a knowledge of the glottal flow maximum amplitude, U, (above a possible constant leakage flow) combined with E, and Fo into a normalised shape parameter

STL-QPSR 2-31 1994

The simplified procedures for extraction of E, demonstrated in Fig. 4 also hold for U, (see Fig. 6), which pertains to the same sentence.

. . . . INTEGRATION ONI-'f

. . , . . . . . . . . . . .

Fig. 6. Contours of glottalJow, U,(t) (without DC component,) within the same utterance as in Fig. 4 derived@om proper inverse filtering, neutral vowel inverse filtering and integration only.

The differences between complete inverse filtering, neutral vowel inverse filtering and a simple integration of the speech wave are not very great as far as this male subject is concerned. A conclusion is that major aspects of the temporal dynamics of voice source amplitude and wave shape variations may be inferred without any inverse filtering, but more experience is needed.

As further outlined in (Fant et al., 1994), default values of LF shape parameters representing normal covariation are unique functions of the R, parameter.

The voice source cutoff frequency Fa is by definition

Fa = F0/(2n R,) and approximately

STL-QPSR 2-31 1994

0 . 50 100 150 200 250 300 350

FUNDAMENTAL FREQUENCY. FO in Hz

Fig. 8. E, versus Fo of a male subject GF and a female subject AK sampled from the accentuation sentences.

As shown in Fig. 9-10, the turning points of E, versus Fo are also apparent dynamically in connection with focal accentuation. This is especially apparent in the female utterance, Fig. 9, where the dominating Fo peak of the focal accent is associated with an intensity minimum and local maxima of intensity appear at FO=215 Hz, both in the ascending and in the descending branch. The maxima are not of the same size suggesting a glottal gesture superimposed on a decaying subglottal pressure starting from an initial high value in the early part of the accent 1 focal area. The same general dynamic behaviour but weaker appears for the male subject (Fig. 10). Here the critical frequency appears as expected around Fo=l 30Hz.

The focal Fo peak appears within the vowel [a:] of the word "Lenar" and at the end of the vowel [e:] in "Lenar". A common feature is that the start of the Fo focal rise is located at the onset of the stressed vowel in synchrony with the pressure pulse. In nonfocal positions, on the other hand, accent 1 syllables display a minimum of Fo when out of focus. These observations conform with the earlier described patterns from prose reading (Fig. 4). A well established feature apparent in our illustrations from the test sentences (Fig. 9-1 l), is a relative attenuation of Fo accent modulations before and after focus (Bruce, 1977; Girding, 198 1 ; Horne & Johansson, 199 1).

The overall tendency of E, versus Fo proportionality is especially apparent in the upper half of Fig. 11, male subject GF. Here and in the lower part of Fig. 11 the main content words "Maria" and "Lenar" are out of focus but there is a tendency of sentence accent with rising Fo on the final word [i'jen] (again).

As expected, there is a clear contrast in vowel durations by a factor of about 2 comparing the long stressed [e:] of "Lenar" and the short unstressed vowel [el of "Len'ar" in focal positions (Fig. 9 and lo), but there is a less pronounced difference comparing the long stressed vowel [a:] of "Lenar" with the short unstressed vowel [a] of "Lenar" which may depend on final lengthening. In nonfocal positions (Fig. 1 I), the durational stressed/unstressed contrast in the vowel [a] is more apparent. This is the most obvious durational distinction in the accent study.

STL-QPSR 2-311 994

a: r i j e n - . . . . . . . . . . . .

36,

I((

,,, I((

168

121

108

Fig. 9. Examples of test sentences from our word accent study, female subject AK of Fig. 8. Focal accent with Fo extending above that of maximum Ee at FO=215 Hz causes a local decrease of intensity as seen in the oscillogram and in the intensity curves, the upper one processed with flat weighting and the lower one with high frequency preemphasis.

I \ I -

MU( r ~ l c h . j ....... ... o m o r , U) o '1 e: n a r i j e n .) .............. ....... , .

-. . ............... ) ................. ............... . . . . . . . . . . . . . . . . . . . . . . . i ....... ..r ........... i ................ ;.. .............................. : ............... +,

C

................................ .i . . . . . . . . . . . . . . . . . . . . . ; . . . . . . . . . . . . . .....,.. ..!... \ .... : ............ ; ................................. .................. ;.. - i - ................ . . ........ . . . . . . . . . . .... .............. ................................................. ; ..........., &4?-.+? j 1 3 F0=215 .i i.. . !.; .. : ;<+--- : . . . ............. ............ ................ ................ ............................ j --~'+--' . . . . . . . . . . . . . . . . 1 ,.. . . . . . . .., :.\.:.;...*;;> . .+ .:. 1 : : . .

i i I i ................ ................ ............. .............. . . ....... ........ ...... ..... ................ ................ ................ ; ; : ; ............... I......... i 1 i . . . : ; : i ;..

I . . 1 . . .

STL-QPSR 2-311 994

Fig. 11. Two sentences, subj. GF, without intended focal accentuation. The final Fo increase can be regarded as a default sentence accent and a continuation tone. Observe the close conformity between Fo and the superimposed E,(t) contour in the upper illustration.

STL-QPSR 2-31] 994

The presence of focus affects the duration of the test word "Lenar" marginally only. On the other hand, as seen in the lower part of Fig 10, a focus on the initial word "Maria" enhances its duration appreciably compared to Fig. 11. The focal accent enhances the H*L Fo fall on [i:(j)] as well as the following H rise on the vowel [a] towards a maximum of the sentence intonation. The focal accent also enhances the depth of the E, minimum and the associated intensity minimum in the maximally closed phase (j). The depth of this minimum, typical of articulatory interaction, is larger than what can be predicted from Fo alone. The same is true of all E, and intensity minima of the consonants [r] and [l] in these illustrations.

Another observation within the focal accent 2 word Maria in Fig. 10 is the typical inverse relation between E, and Fo associated with the higher than optimal Fo already noted for accent 1. The final H of "Maria" also supports the initial H* of the following accent 2 verb "lenar". This is a common aspect of accent coarticulation which is difficult to decompose into its components. Another example is an accent 1 word following an accent 2 word in which case an Fo peak on the second syllable of the accent 2 word may function as the H starting point of the HL* accent 1 fall (Bruce, 1977).

Intensity as a prosodic parameter Intensity appears to play a subordinate role comparing nonfocal stressed and unstressed syllables. The average difference is of the order of 2 dB only. Additional intensity increase in focal stress is of the order of 0-6 dB and highly dependent on the speaker. It was larger for the subject AK than for the subject GF. However, as already demonstrated, the intensity increase is locally reduced within the focal domain by an Fo higher than the optimal value. However, as already shown in connection with Fig. 4 and in Figs. 9-1 1, an alternative source of intensity decline to consider is a transition towards a more constricted vocal tract which affects both the source and the transfer function.

An important covarying factor to intensity variations is the source spectrum which usually attains a high frequency emphasis with increasing scale factor E,. It can be quantified by the cut off frequency Fa, above which the source spectrum attains an extra -6 dB per octave slope (Fant et al., 1985). A normal value for a male voice is Fa = 700 Hz and for a female voice 500 Hz. An increase of E, by 6 dB is associated with an increase of Fa of the order of 50%. This nonlinearity was already noted in the early work of Fant (1959), who reported that an increase of overall voice intensity raising the F l amplitude by 10 dB was associated with 4 dB change in the voice fundamental amplitude and 14-1 8 dB in F2 and higher formants.

As a rule of thumb, a 2 dB increase in overall intensity is associated with about 3 dB increase in a measure with high frequency weighting. However, the nonlinearity may occassionally be greater. An example in our data is the vowel [a] of the word "Lenar" (subject AK, in the lower part of Fig. 9), which is focally stressed. The difference between the preemfasis and flat weighting is here 5-10 dB higher than in the unstressed [a] of "Lenar" in the upper part of Fig. 9.

STL-QPSR 2-3/1994

An apparent aspect of intensity dynamics in our data is the tendency of overall declination within a phrase and at phrase boundaries. In a study of a whole paragraph containing 9 complete sentences and 23 major groups separated by pause breaks we observed an average intensity downdrift of 9 dB from the highest value to a value sampled in the middle of the vowel of the final syllable of the group. Unstressed final syllables were found to be 2 dB below unstressed averages if preceding a continuation juncture and 7 dB weaker if preceding a complete sentence, in which case the pre- pause lengthening was reduced. As expected, these intensity contours largely follow the Fo contours. Prepause lengthening at continuation junctures were of the order of 50-150 ms for stressed syllables and 40-100 ms for unstressed syllables. Final lengthening is accordingly compensated by final intensity decline.

1 A summary view of stress correlates In our view, duration is the most consistent physical correlate of stress. Next, or of equal importance, is the Fo pattern of the associated word accent. However, we find examples of weak stress realised by a significant duration increase but without Fo modulation. On the other hand, there are also examples of focal accent with appreciable Fo-modulation but without appreciable duration or intensity increase. A typical feature is the inverse relation between E, and Fo above the subject's optimal Fo.

As a rule stress, as well as any other prosodic or segmental category, is realised by a set of coordinated phonatory and articulatory gestures which may be simple to conceptualise but difficult to observe and quantify. The corresponding speech wave correlates, on the other hand, are technically easy to portray but are highly distributed and interrelated and can be complicated to handle. However, once the production mechanisms are known, the one-to-many relations between production and acoustic parameters can be exploited in the search of acoustic correlates.

The systematic covariation of vowel and consonant durations within a syllable makes it possible to base durational studies on VC units as an alternative to complete syllables. The duration of a stressed syllable also affects the duration of the associated stress group (foot) which, thus, reflects the stress pattern. Fo correlates of stress are also distributed, e.g., the covariation of accent 2 H*L fall height and the following LH rise, typical of focal accentuation.

An overall impression derived from speech analysis is that the concept of focal stress often is illusive and that there is a need for a continuous scaling of stress or prominence supplementing the phonological discrete stress levels. Our perceptual scaling of a limited speech material in terms of continuous direct estimates has been correlated to duration and Fo modulation depth of word accents. The outcome is that an increase of 0.8 units along our Rw perceptual interval scale covaries with 1 semitone, i.e. 696, on the Fo scale, and a syllable duration increment of 20 ms. These may be regarded as perceptual difference limen.

Intensity is not a consistent stress correlate but is primarily of importance at higher stress levels and for signalling overall voice effort, loudness and a specific breathgroup contour. Intensity is proportional to the voice source scale factor E, and to Fo (about 3

STL-QPSR 2-31] 994

dB per octave increase of Fo). A stress controlled source feature is the spectrum slope which controls the relative balance between high and low frequency parts. An increase of stress or overall loudness is usually accompanied by a high-frequency emphasis.

Articulatory interaction has an important role in the overall system. Emphasis (hyperarticulation) causes overshoot of articulatory targets and deemphasis (hypoarticulation) produces undershoot and reductions. A result of emphasis is that constricted articulations will be more constricted and open articulations more open, thus, expanding the contrast between voiced consonants and vowels. This contrast is not only related to the particular transfer functions but also to the voice source which with increasing degree of constriction becomes attenuated and usually attains an increased spectral slope (Bickley & Stevens, 1986). In Swedish, the close vowels [i:], [y], [u:], [u:] and to some extent also [e:] and [o:], have an inherently weaker voice source than the more open vowels. This tendency is accentuated by the tendency of a dynamic narrowing when stressed. Is the higher inherent Fo of these close vowels, providing higher E,, a compensation for the articulatory interaction?

Glottal articulations follow similar rules. An intervocalic voiced [h] looses its voicing and the noise component gains when stressed, which is a matter of extended abduction (unless realised by a glottalstop). This is also the case of increasing the stress level of a vowel followed by an unvoiced stop, which causes a breathy preocclusion termination of the vowel including formant damping, a finite influence of subglottal poles and zeros and noise generation (Fant & Lin, 1988; Klatt & Klatt, 1990). The situation is in part similar to breathgroup final relaxation of the vocal folds.

We still have much to learn about how the coordination of lung pressure and glottal adjustments in synchrony with overall articulatory patterns affects the voice source. It seems probable that chest pulses of subglottal pressure increase appear in focal and higher degrees of stress only. To what extent can a glottal gesture alone contribute to the realisation of stress? One obvious mechanism is the particular Fo modulation which can be the major component of focal accent. Another is a more abrupt vocal fold return at closure providing high frequency emphasis, i.e., an Fa increase. There is data in the literature on the covariation of Fo and subglottal pressure (Ladefoged, 1963; Strik & Boves, 1993), but it remains to find out how lung pressure and glottal adjustments covary in determining E,(FO) and U,(FO) within a breathgroup and in specific the characteristic resonance curve shaped Ee(FO) with a maximum in the speakers midfrequency register. A more general problem is the temporal synchrony of lung pressure, glottal adjustments and articulatory gestures within a linguistic frame. We have some evidence of a coordinated activity to enhance the early part of a stressed vowel and the articulatory contrast in the CV boundary, in other words a P-center marking (Marcus, 1981). How are such phenomena influenced by accent 1 versus accent 2 and by the phonological length of the stressed vowel? These are topics worth while looking into.

STL-QPSR 2-31] 994

Acknowledgements This research has been supported by grants from The Bank of Sweden Tercentenary Foundation, the Esprit Speechmaps program and by a contribution from Telia Promotor Infovox AB. The Swedish Council for Research in the Humanities and Social Sciences (HSFR) has supported some of our earlier work reported here.

References I Bickley, C.C. & Stevens, K.N. (1986). "Effects of a vocal tract constriction on the glottal source: Experimental and modelling studies. "Journal of Phonetics 14, pp. 373-382.

Bruce, G. (1 977). Swedish Word Accents in a Sentence Perspective, Lund: CWK Gleerup.

Bruce, G. & Granstrom, B. (1993). "Prosodic modelling in Swedish speech synthesis", Speech Communication 13, pp. 63-73.

Carlson, R. & Granstrom, B. (1986). "A search for durational rules in a real-speech data base", Phonetica 43, pp, 140-1 54.

Elert, C.C. (1968). "Allman och svensk fonetik", Stockholm: Almquist & Wiksell.

Engstrand, 0. (1989). "Phonetic features of the acute and the grave word accents: data from spontaneous speech," PERILUS, University of Stockholm, No X, pp. 13-37.

Fant, G., (1959). ''Acoustic Analysis and Synthesis of Speech with Applications to Swedish," Ericsson Technics No. 1.

Fant, G. (1 982). "Preliminaries to analysis of the human voice source," STL-QPSR 4/1982, pp. 1-27.

Fant, G. (1993). "Some problems in voice source analysis", Speech Comm. 13, pp. 7-22.

Fant, G. & Kruckenberg, A. (1989). "Preliminaries to the study of Swedish prose reading and reading style", STL-QPSR 2/1989, pp. 1-83.

Fant, G. & Kruckenberg, A. (1993). "Towards an integrated view of stress correlates", Working Papers 41 (Dept. of Linguistics, Lund University), pp. 42-45.

Fant, G. & Kruckenberg, A., (1994). "Voice source parameters in connected speech. A progress report," Working Papers 43, Lund University, Dept. of Linguistics, pp. 58-61.

Fant, G. & Liljencrants, J. (1994). "Data reduction of LF voice source parameters," Working Papers 43, Lund University, Dept. of Linguistics, pp. 62-65.

Fant, G. & Lin, Q. (1988). "Frequency domain interpretation and derivation of glottal flow parameters," STL-QPSR 2-3/1988, pp. 1-2 1.

Fant, G., Kruckenberg, A. & Nord, L. (1990). "Acoustic correlates of rhythmical structures in text reading", Nordic Prosody L( Turku, pp. 70-86,

Fant, G., Kruckenberg, A. & Nord, L. (1991a). "Temporal organization and rhythm in Swedish", Proceedings of the XIIth ICPhS, Aix-en-Provence, pp. 25 1-256.

I Fant, G., Kruckenberg, A. & Nord, L. (1991b). "Some observations on tempo and speaking style in Swedish text reading". ESCA Workshop on "The phonetics and phonology of speaking styles", Barcelona.

STL-QPSR 2-31] 994

! Fant, G., Kruckenberg, A. & Nord, L (1991~). "Durational correlates of stress in Swedish, French and English", Journal of Phonetics 19, pp. 35 1-365.

Fant, G., Kruckenberg, A. & Nord, L. (1991d). "Stress patterns and rhythm in the reading of prose and poetry with analogies to music performance", Music, Language, Speech and Brain, Wenner-Gren Center International Symposium Series, Vol. 59 (J. Sundberg, L. Nord, R. Carlson, eds.) pp. 380-407.

Fant, G., Kruckenberg, A. & Nord, L. (1991e). "Prosodic and segmental speaker variations", Speech Communication 10, pp. 52 1-53 1

Fant, G., Kruckenberg, A. & Nord, L. (1992). "Prediction of syllable duration, speech rate and tempo", Proceedings ICSLP 92, Banff, Vol. 1. pp. 667-670.

Fant, G., Liljencrants, J. & Lin, Q. (1985). "A four-parameter model of glottal flow", STL- QPSR 411985, pp. 1-13.

Fant, G., Kruckenberg, A., Liljencrants, J. & Bivegird, M. (1994). "Voice source parameters in continuous speech. Transformation of LF-parameters", ICSLP-94, Yokohama.

Girding, E. (1981). "Contrastive prosody: A model and its application", Studia linguistics. Vol. 35, No. 1-2, pp. 146-165.

Girding, E. (1 995). "Intonation in Swedish," Working Papers 35, Lund, 1991, pp. 63-88. Contribution submitted to Intonation Systems (A. di Cristo & D. Hearst, eds.) Cambridge University Press.

Home, M. & Johansson, C. (1991). "Lexical structure and accenting in English and Swedish texts," Working Papers 38, Lund University Dept. of Linguistics, pp. 97-1 14.

Klatt, D.H. & Klatt, L.C. (1990). "Analysis, synthesis and perception of voice quality variations among female and male talkers", J. Acoust. Soc. Am., Vol. 87, pp. 820-857.

Kruckenberg, A. & Fant, G. (1993). "Iambic versus trochaic patterns in poetry reading", Nordic Prosody VI, Stockholm, pp. 123- 135.

Ladefoged, P. (1963). "Some physiological parameters in speech", Language and Speech, Vol. 6, part 3, pp. 109- 1 19.

Marcus, S.M. (198 1). "Acoustic determinants of perceptual center (P-center) location", Perception and Psychophysics 30, pp. 247-256.

Nord, L. (1986). "Acoustic studies of vowel reduction in Swedish", STL-QPSR 411986, pp. 19-36,

Strik, H. & Boves, L. (1993). "A physiological model of intonation", Proc. Dept. of Language and Speech, Univ. of Nijmegen, (1 99211 993) 1611 7, pp. 96- 105.