perception of spe ra related aspects and - s u
TRANSCRIPT
On Vowels
Perception of Spectral Features, Related Aspects of Production and Sociophonetic Dilllensions
Hartmut Traunmuller
University of Stockholm, 1983
Akademisk avhandling for filosofie
doktorsexamen
Institutionen for lingvistik
106 91 Stockholm
Abstract
The first and major part of this thesis deals with spectral features of
vowels and with the distinction of phonetic information from personal
and transmittal information also conveyed to listeners by speech sounds.
The results of perceptual experiments with synthetic vowels whose
fundamental and first formant were varied in frequency suggested that
the smaller tonotopical distances between formants « 6 Bark) are
invariant in phonetically identical vowels. This was also confirmed by an
analysis of formant frequency data of vowels produced by male and
female speakers of several languages. It is further investigated how
partials are resolved in the process of timbre perception. Previous
experiments by other researchers suggest an effective bandwidth close
to three Bark. In similar experiments, though using different stimuli, this
result could not be replicated. A re-analysis of some other experimental
results gave, among other details, effective bandwidths roughly propor
tional to frequency in the range below 600 Hz. Due to contextual effects,
the general validity of this result is in question. The non-uniform
sex-differences in formant frequencies are shown mainly to be conse
quences of an anatomical development in accord with the perceptual
condition of invariant phonetic qualities.
The second part of the thesis, Vocalism in Eastern Central Bavarian, represents a case study of the realization of sociophonetic dimensions in
speech. In the chosen group of dialects some phonological rules lead to a
richly shadowed vowel system. The application of these rules is
investigated with respect to dialectal, sociolectaJ, speaker age, and
speech tempo variation.
© 1983 Hartmut Trannmuller
ISBN 91-7146-324-0 minab/gotab, Stockholm, 1983
FÖRORD
Doktorsavhandlingar brukar oftast behandla ett noga avgränsat ämne
enligt en i förhand uppställd plan. Jag har emellertid valt ett annat
tillvägagångssätt. Med utgångspunkt i ett experiment, rörande percep
tionen av vokaler, följde jag de uppslag och fråge ställning ar som
uppenbarade sig under arbete ts gång. Dess a uppslag ledde delvis åt
vitt skilda håll. Således berör jag förutom huvudtemat, talets percep
tion, även psykoakustik, talets produktion, fonologi och sociolingvis
tik. Ämnesbehandl i ngen bl ev dä ri genom i fl era av seenden mi ndre dj up
gående. Jag 1 eddes av övertygel sen, att ett obundet forskande skull e
leda till ett ökat vetenskapligt utbyte.
Förutom från mi n handl edare, Bj örn Li ndb lom, vars värdefull a syn
punkter har påverkat utformningen av flertalet del uppsatser som ingår
i avhandl ingen, har jag utnyttjat uppsl ag och synp unkter från James
Lubker (uppsats 2), Tore Janson och Astrid Stedje (uppsats 6) samt
från de anonyma förhands recensenterna av uppsatserna 2 och 3.
Johan Sundberg och Björn Lindblom har ställt sina ansat srörsdata
t ill mitt för fogande och Johan Liljencrants har intro ducerat mig i
sitt LEA-simuleringsprogram.
Karin Holmgren, Peter Branderud och Johan Stark har på 01 ika sätt
underlättat mitt umgänge med fonetiklaboratoriets dator.
Richard Schulman och James Lubker har granskat min engelska och
Suzanne Schlyter har gjort om mina versioner av sammanfattningar på
franska .
Karin Holmgren, Milly Söderman med flera har hjälpt mig vid manu
skriptens framställ ning.
Gunnar Fant och hans medarbetare vid Institutionen för tal över
föring, KTH, har visat ett stimulerande intresse för mitt arbete.
T i l 1 a l l a, ä ve n t i l l dem s om h a r s t ä l 1 t up P som f ö r s ö k spe r s o n e r ,
vill jag framföra ett hjärtligt tack!
Jag vill också tacka min maka, Neeltje, som har haft överseende med
min ibland nyckfulla arbetsrytm och an dra påfrestningar som mitt
arbete har medfört.
I wish to express my gratitude to the anonymous reviewers of papers
2 and 3, whose valuable criticism led to a significant improvement of
these papers.
Stockholm, 1983-04-19
Hartmut TraunmUll e r .
The thesis consists of the present summary and the fol lowing papers by
Hartmut TraunmUll er:
( 1) Analytical expressions for the tonotopical sensory scal e, submit
ted to Acusti ca.
(2 ) Perceptual dimension of openness in vowels, in: The Journal of the
Acoustical Society of America, 69, 1981, pp. 1 465-1475.
(3) Articulatory and perceptual factors controlling the age- and sex
conditioned variability in formant frequencies of vowels, submitted to
Speech Communication.
(4) Perception of timbre: evidence for spectral resolution bandwidth
different from critical band?, in: R. Carl son and B. Granstrom (eds. l , The Representation of Speech in the Peripheral Auditory S y stem,
Elsevier Biomedical Press, Amsterdam, 1982, pp. 103-108.
(5 ) Die spektrale Auflosung bei der Wahrnehmung der Kl angfarbe von
Vokalen, submitted to Acustica.
(6) Der Vokalismus im Ostmittelbairischen, in: Zeitschrift f'jr Dia
lektologie und Linguistik, 49, 1982, S. 289-333.
Contents: page
1 Spectral features of vowel s 2 1.1 The tonotopical sensory scale 2 1 . 2 The acoustic-to-phonetic transformation 3 1 . 3 Spectral resolution in timbre perception 8
1. 4 Acoustical consequences of vocal tract growth 13
2 Vocalism in Eastern Central Bavarian 15
1
1 SPECTRAL FEATURES OF VOWELS
1.1 The tonotopical sensory scale
(paper 1)
One of the principal features of the sense o f hearing is the frequen
cy-to-place transformation realize d in the cochle a. The resulting
tonotopical order is kno wn to be maintained also at higher levels in
the brain. A psychoacoustical measure of the tonotopical dimension is
known as "cr itical-band rate" or "tona l ity" ("Tonheit") z. Its unit,
one Bark, is e qual to the fundamental bandwidth of resolution evi
denced in loudness summation and in mask ing experiments. One Bark cor
responds approximately to 25 steps in just noticeable frequency diffe
rences and to 150 haircells and 1.3 mm along the basilar mem�rane. The
Bark-scale, see Figure 1, has been published by Z wicker (1960) in the
form of a table. Several analytical expressions which approximate this
function have been publish e d. In the course of the inve stigations
summari zed below, the publ i shed equati ons were found not to be accu
rate enough in some applications in w h ic h either the tonoto pical
di stance between for-
mants was to be assessed
or the formants of syn
thetic vowels were to be
shifted a certa in dis
tance along the tono
topical scale. In ad
dition, some of the e
quations were not simple
to invert. Therefore a
new attempt was made
with a rigorous demand
for accuracy withi n the
range of vowel formant
frequencies up to child
ren'S F4. T h e obtained
express i on compa res with
the most accurate ones
previously kno wn as fol
lows:
24
t 20 _2_,_ 26,81 - 0,53 ---, =-Bark 1960 Hz 1 +
-'" f '-JS 16 N
12
8
4
0 0,1 0,2 0,5 2 5 10 20
f [kHz] ---
Figure 1. Critical-band rate z as a function of frequency f. Data
pOints from Zwicker (1960). Frequency f scaled logarithmically. The
curve corresponds to equati on (4),
2
Equation and inverse
(1) z = 7ln [(f/650) + [ (f/650)2 + 1]112]
(2) f=650sinh(z!7)
(3) z = 13atn( O.00076f) + 3. Satn( f!7 500) 2 no simple inverse
(4) z=26. 8lf/ (1960+f) -0. 53
( 5 ) f = 1960 (z + 0. 53) / (26. 28 - z)
Authors:
Deviation from table
+ - 0. 13 Bark for
f < 4000 Hz
+ - 0.20 Bark
+ -0.05 Bark for
200 Hz < f < 6700 Hz
(1), (2): Schroeder; (3): Zwicker and Terhardt; (4), (5): Traunmiiller.
The subtractive constant in (4) is irrelevant in applications where
only distances between formants are to be known. The paper also con
tains modifications of eqs. (3) and (4) to cover the whole range of
auditory perception with high accuracy. These equations are less
simple, however. It is also shown how to calculate critical band
widths.
1.2 The acoustic-to-phonetic transformation
(papers 2 and 3)
Speech sounds, as heard by listeners, contain besides the phonetically
coded verbal information al so information about the speaker (e.g. age,
sex, mood) and about his location in relation to the listener. The
acoustic properties of the realizations of given phonemes vary widely
depending on the physical dimensions of the speaker's vocal tract, on
his vocal effort, and on the transmission of the signal (distance,
reverberation). Listeners are able to distinguish these phonetic,
personal, and transmittal qualities of speech sounds. This raises the
question on which decisive properties the perception of phonetic
quality is based. Besides the mentioned factors, the acoustic proper
ties of given speech sounds or, seen the other way, the phonetic
quality of a sound with given acoustic properties is also dependent on
context. The present study, however, treats only intrinsic factors of
vowels.
3
A perceptual analysis of vowels spoken by men, women, and children,
recorded on a gramophone disk and reproduced with several different
velocities led Chiba and Kajiyama (1941) to the conclusion that "a
vowel is character ized by its relative formants, provided t h at t h e
centers of the formants are situated within certain regions fixed for
a given vowel". Assuming, as these researchers did, a linear relation
ship between the logarithm of frequency and the positio n along the
basilar membrane, vowels with the same formant frequency ratios would
create equivalent exc itation patterns. Potter and Steinberg (1950) posed essentially the same hypothesis: " within limits, a certain
spatial pattern of stimulation along the bas ilar membrane may be
identified as a given sound regardless of posit ion along the mem
brane" .
The di sp 1 acement between the formant frequency patterns of the same
vo wels spoken by men, women, and children sho ws indeed a certain
degree of uniformity on a tonotopical scale. The female-male differen
ces in F4, F3, and, except for back rounded vowel s, in F2 all come
close to 1.0 Bark. F 2 in back vowels and F1 in closed vowels appear,
however, to deviate from this tendency (see Table 1). This is substan-
Vowel Fl F2 F3 kl k2 k3 dl d2 d3
u 310 760 2225 1.06 1.01 1.23 0.20 0.05 1.40
o.U 425 815 2375 1.07 1.05 1.17 0.25 0.30 1.05
J 500 840 2470 1.11 1.06 1.13 0.45 0.35 0.80
u.a 670 1045 2510 1.17 1.12 1.15 0.85 0.70 0.90
a 735 1270 2480 1.25 1.15 1.15 1.25 0.90 0.90
II! 650 1670 2425 1.2 7 1.17 1.18 1.30 1.05 1.10
E 480 1840 2455 1.19 1.18 1.20 0.80 1.10 1.20
e. I 360 2045 2580 1.11 1.22 1.18 0.35 1.35 1.10
275 21 90 2950 1.07 1.21 1.13 0.20 1.25 0.80
Y 265 1835 2225 1.00 1.19 1.17 0.00 1.15 1.05
� 375 1610 2185 1.05 1.16 1.16 0.20 1.00 1.00
re 450 1390 2275 1.07 1 .1 8 1.17 0.30 1.10 1.05
Table 1.
Formant frequencies F. in Hz, i n vowels spoken by men; formant frequency ratios
k=F IF •
fem mal e' and critical-band rate differences d = z(Ffem) - z(Fmale), in
Bark. Mean v alues from spea kers of several la nguages (Am. Engl ish, Sw edish,
Serbocroat ian, Dan ish, Estonian. and Dutch). t aken from a study by G. Fant
(1975). For more det a i 1 s see paper 3, Tab 1 es 1 to 3.
4
tiated by fairly uniform published data on vowels from speakers of
several 1 anguages.
The present investigation of the factors decisive for perceived phone
tic quality involved the following steps:
(1) The factors influencing the perceived phonetiC quality of synthe
tic vowels containing only one formant were mapped. The two parameters
fo and F� were varied over the whole range of frequencies that can be
observed for f 0 and F1 in natural speech . The vowel s were identified
by 23 native speakers of a Bavarian dialect in which five degrees of
vowel openness occur di stinctively (cf. Table 2, page 15).
(2) It was investigated to what extent variables other than fo and F1 in natural vowels, i.e , the higher formants influence the perception
of features carried by fo and F� alone in one-formant vowels. To this
end, synthetic versions of natural vowels with F� and/or fo systemati
cally displaced in frequency were generated to be identified by those
same subjects.
(3) From the results of these experiments it coul d be inferred that
the phonetic interpretation of the cochlear version of speech sound
spectra (excitation patterns) is based on an analysis of local confi
gurations no wider than roughly 6 Bark. The shape of these configura
tions, which depends critically on the distances between neighbouring
formants, is more important for the phonetic quality than the absolute
position of these configurations. On this basis it was predicted that
in vowel s sharing the same phonetic qual ity, any distances between
formants will be invariant if they are smaller than 6 Bark. This
prediction was then confirmed by an analysis of the data (Table 1) on
male and female vowel productions. It can be seen in Figure 2 that the
distances between F3 and F2 and between F2 and F1 are almost the same
in vowels spoken by men and by women as long as they are smaller than
5 or 6 Bark, while most larger distances happen to be smaller in
vowel s spoken by men than in those by women. The t onal ity distance
between F1 and the glottal "formant" Fg, just above f 0' whi ch shows up
as a peak in the spectrum of open vowels also appears to be invariant
and should be so. The data on this matter are not, however, highly
re 1 i ab 1 e.
5
The perception of the phonetic
qual i ty of vowel s can be seen
as a process of tonotopical
gestalt recognition, analogous
to vi sua 1 percept i on of form.
Loca 1 features p1 ay an i mpor
tant ro1 e in both cases. Thi s
analogy is further corrobo
rated by the transposabi1 ity
of formant patterns on diffe
rent carriers such as buzz or
noise (in whispering). Visual'
form can also be carried by
different underlying struc
tures.
The conclusion that local con
figurations are more important
than the overall shape of the
Z2-Z1
[Bark)
10
5
\ Y \e
E 1-I
i2'f. I _________ L _________ _
I I I
a b I ...... a
---.. u
O+-�--���--+_-r�r_.__.__r o 5 Z3 - Z2 [Bark] 10
Figure 2. Tonotopical distance, expressed in Bark, be
tween F2 and F1 plotted against that between F3 and F2 in the vowels of Table 1. Crosses: lIale versions; rings:
fe_ale versions. These distances are predicted to be
f nvari ant if they are below 5 or 6 Bark.
excitation pattern was arrived at by several observations:
** The phonetic quality of the male vowels [i e e] was unchanged
when both Fl and fo were moved up in frequency while their distance
was kept constant on a modified tonotopical scale that had the
property of leaving the perceived degree of openness invariant
(paper 2, expo 2). On the basi s of an overall match, these vowel s
should all have been heard as rounded due to the decreased distance
between F2 and Fl. Actually, this happened only when this distance
became less than 6 Bark (in the [e]-based stimuli). Thus, the dis
tance between F2 and Fl is not a preva i 1 i ng cue to roundedness if
it is larger than 6 Bark.
** When fo was moved up in frequency in the vowels [e E �], these
were heard as progressively less open, to end up with Ii] or [y]
(paper 2, expo 4). This is in accordance with the results of the
one-formant vowel identifications. In the [�]-based stimul i, how
ever, a large minority of subjects perceived no change in phonetic
identity when fo was moved up. In this vowel, the distance between
F2 and F1 was only 5 Bark. Hence, the identification can be based
on the configuration shaped by FI, F2, and the higher formants.
6
** Previous experimentation had shown that one-formant vowels with
the formant located above 1.2 kHz were i denti fi ed on the basi s of
the position of that formant with little or no dependence on fo. In
this case, then, neither the overall shape nor any local confi
guration provides any cues as to the phonetic identity of the
stimulus.
** It might be expected that one-formant vowels would be heard as
back vowels since they are dissimilar to front vowels in that they
do not have any prominent components at high frequencies, but only
one of the subjects perceived them this way . The other 22 heard
mostly front vowels, in particular front rounded ones, though also
some back vowels, mostly [u] and [0] ( besides [ a ] , for which no
front/back opposition exists in the subjects' dialect). If local
con figurations are more important than the overall shape of the
spectrum, then the prevailing cue to backness will be that there is
a second formant, F2, close to Fl, while in front vowels F2 is
close to F3. In one-formant vowel s there is defi ni tely no second
formant close to the first, and this appears often to prevent
people from hearing back vowels, although these stimuli lack any
kind of positive cue to frontness and are more similar to back
vowels in overall spectrum shape. Those back vowels in which F2 is
generally most promi nent ([J] and [oJ) were rarely ever heard.
It is well known that the frequency position of the first formant is
correlated with the perceptual dimension of openness in vowels. One
formant vowels can be expected to carry at least some information
about this dimension. Figure 3 shows the identifications of one
formant vowel s. Phonemes wi th the same degree of openness are col
lapsed. It can be seen that the prevailing criterion for perceived
openness is not F� alone, but the tonotopical distance between F� and
fo (Boundaries running horizontally in the figure). Only when F� is
very high, is the position of F� alone decisive (boundary between [aJ and less open vowels at fo > 3 50 Hz). At fo = 350 Hz, an abrupt
change in response behaviour can be seen. At higher fundamental fre
quencies, the distance between the first two partial s is apparently
too large to allow the ear to extract the original formant. The second
partial is discerned as the first spectral peak above fo and apparent
ly interpreted as FI, with fo as Fg.
7
The experiments with shif
ted fo and/or Fl demon
strated that the relation
between Fl and fo is in
deed the prevailing cue to
pe rce i ved deg ree of open
ness in natural, voiced
vowels. The higher for
mants contribute only mar
ginally .
At fundamental frequencies
below 350 Hz, the distance
between F� and fo is not
str ictly invariant in
vowels with the same
p e r c e ive d d e g r e e o f
openness. This can be seen
m o r e c l e a r l y b y a n
analysis of vowel s spok en
by men, women, and chil
dren (see paper 3, 'figure
9) . It reveals th at in
closed and half-open vo-
.E c .. E c " c "
" c c
c c E L o
c .. .. »
.. ... c c +-'" "
7
6
5
4
3
2
0,02 0,1
tonality of fundamental [Bark]
0,2 0,3 fundamental
0,4 0,5 frequency
0,6 [k Hz J
0,7
Figure 3, Identifications of one-formant vowels by subjects
competent in an Eastern Central Bavarian dialect. Horizon
tally: fundamental frequency. scaled tonotopically; verti
cally: tonotopical distance between formant [or partial) and fundamental; bisector of the coordinates: tonotopical posi
tion of formant. Phonemes with the same degree of openness
(or "vowel height") collapsed, Dashed areas: Boundary re-
gions with l ow conformity between subjects. First � \0.... .. -partials also shown.
wel s the tonotopi cal di stance between F1 and f 0 is small er in vowel s
spoken by women than in those by men and children. Several alternative
hypotheses able to explain this observation are discussed. At least in
par t, this particu lar ity may be due to restricted spectral
selectivity, as discussed in the following section.
1.3 Spectral resolution in timbre perception
( papers 4 and 5)
In ordinary vowels, the resonances of the vocal tract, known as for
mants, are "sampled" by the partials of the glottal voice. This leads
to a more or less exact rendition of the formants, depending on fo' In
a rough approximation, the formant bandwidth B follows the expression
B = 0.05 F + 50 Hz, wi th F = center frequency of the formant. For low
formant frequencies, the interspace between consecutive partials is
3
wider than B even in l ow
pitched voices (see Fi gure 4). Despite the consequently un
certain rendition of the peaks
of lower formants, 1 i ste n ers
are nevertheless capab l e to
extract a feature ( openness ) closely associated with t h e
frequency pos i t i on of Fl.
On the perceptual side o f this
pr obl em, the k n own frequency
selectivity of the ear has to
be considered. A frequency
band of 1 Bark comprises at
any p osit i on a range of fr e
quencies wider than for mant
bandwi dth B.
If two partials are closer
than 1 Bar k, they will not
produce separate pe a k s in
1'0
[k�Zl 5
2
0,5
200 300 _400 fo [Hz]
Figure 4. Frequencies f above which there is always
at least one pa rtial within the frequency range
f : 0.025 f : 25 Hz (formant peaks, c u rve a) or
within f � 0.5 Afg
(one-Bar\( frequency bands, curve
b) at a f undamental frequency fo
' In the region
below c urve a, the formant peaks are deficiently
reproduced and in that below curve b, single par
tial s can appear as peaks in a diagram of loudness
density vs. critical-ba nd rat e. The dashed region
covers the range of characteristic frequencies of
vowels (fo to F4) at normal phonation.
loudness-density over critical-band rate. At sufficiently low funda-
mental frequencies, peaks will appear in the auditory spectrum only
where shaped by formants. At higher fo
's, the l owest partials will
shape thei r own peaks ( see Fi gure 4). In the frequency range where
this occurs, particularly in the speech of women and children, we find
Fl and in back r ounded v owels also F2. These f ormants will then be
diff i cult to locali ze and the second partial might be interpreted as
Fl. In natural vowels, this might not occur, since the higher formants
make the " c orrect " identification of these vowel s possible. In one
f or mant v owels, the expected consequence can be observed, but it
appears only at fo > 350 Hz ( see Figure 3). This could be explained on
the basis of an effective bandwi dth of resolution or spectral integra
tion close to 3 Bark.
Spectral integration over a range of 3 Bark could also explain the
fact that the difference in the position of Fl betw �en male and female
producti ons of closed and half-open vowels is smaller tha� e x pected on
the basis of an invariant tonotopical distance between Fl and fo
' In
9
the tonotopical representation of these vowe l s, the configuration
shaped by the partials up to and including Fl can be considered deci
sive for perceived openness. Th is configuration is characterized by a
certain distance between its low-frequency flank and the peak shaped
by Fl. This distance, which we would predict to be invariant, would be
independent of fo for f
o < 150 Hz (� 1.5 Bark), because for lower
f 's, the low frequency flank of the configuration would coincide with o
the end of the tonotopical scal e.
It has been observed previousl y, that a group of closel y spaced vowel
formants can be repl aced by a single formant in simplified synthetic
vowels. The second formant in synthetic two-formant vowel s stands not
only for F2, but for the who l e group of higher formants (F2, F3, F4)
in closed an d hal f-open front vowe l s. In Fi gure 3 , the boundary be
tween raJ and less open vowel s, mostly liE] and [CE], does not match
the ordinary position of Fl. Apparentl y, the singl e formant has been
matched to some mean of nan d F2 in [ a] , where these formants are
closer than 3 Bark, while in liE] and [CE], where they are more
distant from each other, it has been matched with F1 alone.
In some experiments by Chistovich et al. (1979), subjects had to match
vowel-like two-formant sounds with one-formant sounds by adjusting the
f requency posit ion of a formant. The resu l ts s h owed that the single
formant was p l aced in the midd l e between the formants of the two
formant stimulus as l ong as their tonotopical distance�z remained
below a critical va l ue.6zc of 3.0 to 3 .5 Bark. The preferred position
of the s i ngl e formant coul d be moved conti nuousl y between the two
formants by variation of their relative leve l s. If6Z was increased
above6zc
' this was not feasible any more, and the single formant was
placed close to either one of the two formants.
Further evidence fo r an effective bandwi dth larger than 1 Bark was
obtained by Benedini (1978), investigating the timbre differen ces
between complex tones consisting of four, five, or six harmonics of
100 Hz. In a model si mul ati on, presuppos i ng gaussi an shaped spreadi ng
of each partia l , Benedini arrived at 6 = 1 Bark , equivalent t o a
bandwidth of roughly 2 Bark.
All of these observations might be explained by one and the same, and
possibly quite peripheral feature of t he auditory system. The present
1�
studies were intended to further illuminate this possibility.
In a perceptual experiment, subjects had to rate the simil arity be
tween a speech-unli ke two-formant noise and a tone. The tone was equal
in frequency either to one of the formants or to their critical-band
rate mean, which was at 1.6 kHz in each case. The distance between the
two formants was varied between 1 and 5 Bark in steps of 1. In a
second experiment, the noises and tones were replaced by buzz-excited
two-formant and one-formant sounds.
The essential findings were the following:
** Most subjects ignored F2. Their ratings were quite high ly
correlated with the tonotopical closeness of F1 to the matched tone
or single formant.
** The perceptual saliency of Fl was much higher than that of F2,
even when F1 was attenuated (in a variation of the first experi
ment) .
** Subj ects appeared to use the same criteri a whether match i ng
simple tones or one-formant sounds to two-formant stimuli.
** A minority of subjects gave higher ratings to pairs in which
the single formant or tone was at the critical-band rate mean of
t he formant pairs. This tendency was present up to the maxi mal
distance occurring in the set of stimul i (5 Bark).
These results, which are strikingly dif ferent from those obtained by
Chistovich et al. (1979) with more vowel-like stimuli, do not support
the hypothesis that a 3-Bark band of integration is fundamental to
timbre perception in general. The majority of subjects in the present
exper iments apparently based their judgements on the pitch of the
lower formant vs. that of the matched tone or single formant.
A re-analysis of the timbre dif ferences measured by Benedini led to
the following findings:
** The perceptual 'Neight of single partials, equal in level, was
in that experiment roughly proportional to their frequency (with
f � 60 0 Hz), except for the lowest partial present in the sound.
*� The per c e p t u a 1 wei g h t 0 f the lo wes t par t i a 1 a p pea red to b e
substantially increased. The timbre differences created by removal
of that parti al probably constitute a dimension (full vs. residual
tones) dif ferent from that descriptive of the other timbre diffe-
11
rences present in that experiment.
** The bandwidth o f spectral resolution was also proportional to
f. The resolution function can be described by a resonance of the
type a = 11 (1 + 112/n)n/2
, where a = attenuation rel ati ve to the
peak and.fl.= vld, with v = f2/fl - f1/f2' and d = damping coeffi
c ient. The best fit to the data was obtained with n = 2 and d =
0.316 (Q = 3 .16).
** The perceived magnitude of timbre differences was intricately
dependent on context of presentation.
The other experimental results suggesting a bandwidth of roughly 3 Bark would be compatible with the assumption that d is constant over
the whole range of auditory perception. A d = 0.316 is equivalent to a
bandwidth B lIS 2. 0 Bark in the frequency range 0 .8 kHz < f < 4.8 kHz.
The higher value of � obtained in formant matching experiments could I 4 Zc be accounted for as being due to the addition of the intrinsic B of
the formants involved. At lower frequencies, however, the bandwidth of
resolution evidenced here is clearly too narrow to solve the problem
illustrated in Figure 4.
An analysis of the perceived timbre differences between complex tones,
low pass limited at frequencies fh b etween 130 and 1 7 20 Hz, also
measured by Benedini (1978), showed that these differences were highly
correlated with the tonotopical distances between these limiting fre
quenci es. I f the spectral resolution obtained from the precedi ng
experiments is taken into account by replacing fh
with 1.4 fh, a still
higher correlation is obtained (rank order correlation coefficient rs
= 0.993) . However, if the frequen cy dependence of the perceptual
weight of the partials is also taken into account, this correlation
declines. This indicates that the listeners recognized that the stimu
li differed only in fh' and consequently based their judgements exclu
sively on the distances between the upper flanks o f the compared
harmonic tones.
Consequently, it is concluded that the judgement of timbre differences
involves the extraction of certain dimensions or features induced by
context. It is suggested that the limited resolution that appears in
experiments involving vowel quality may be due to a limited resolution
intrinsic to the phonetic templates supposedly stored in memory.
12
1.4 Acoustical consequences of vocal tract growth
( paper3 )
The speaker category differences in the characteristic frequencies of
vowels are, obviously, a consequence of physiological facts . Th e
physical properties of the glottal structures determine the range of
fundamental frequencies that can be used comfortably, and the dimen
sions of the vocal tract confine the range of possible formant fre
quency variation. Most of the differences in vocal tract shape between
men and women are due to the p hysiological changes affecting boys
during puberty. We may, then, ask whether these changes demand a
modification of the speaker's articulatory habits in order to keep the
phonetic qual ity of vowel s the same as before, or wheth er, inversely,
the differences in the acoustic data can be understood as a result of
unchanged articulatory habits in spite of the physiological changes.
This last hypothesis does not apply to the earlier developement in
c hildren whose articulatory habits are not yet rigidly and precisely
established.
To a first order of approximation, the age- and sex-dependent diffe
rences in vowel formant frequencies can be understood as a consequence
of a proportional re-scaling of all t hree dimensions of the vocal
tract. This would leave the formant frequency ratios i'1variant. All
t he k-values in Table 1 would then be the same. By virtue of the
observed systematic deviations from the mean k-value, we conclude that
for the same vowels of different speaker categories, the vocal tracts
are not only different in size, but not even proportional in shape.
The p h ysiological changes affecting boys during puberty include an
elongation of the vocal folds, an increase in the cross sectional area
of the pharyngeal tube, and an elongation of that tube du e to the
descent of the larynx. Further, the back of the tongue is pulled down
to some extent as a consequence of larynx descent. Ceteris paribus,
the cross-sectional areas of the vocal tract will then increase also
in the pa1 ato-velar region. In terms of percentage change, t h is in
crease wi 11 be 1 argest for closed ( hi g h ) vowel s. Thi s effect has not
been taken into account in previous studies treating this topic.
The acoustical consequences of these changes were estimated by means
of model calculations using a computer program simulating an electri-
13
cal line analog of the vocal tract. The area functions were taken from
data by B. Lindblom and J. Sundberg on a mal e speaker. They were
changed slightly in order to yield approximately the formant frequen
cies in Table 1. The vocal tract shapes we re conseq ue ntly perturbed in
a way rather crudely cancelling the changes occurring dur ing puberty.
It was f urther assumed that these changes account for most of t h e
differences between men and women. The results of the calculations are
shown in Figure 5 together with the observed female/male k-values. It
can be seen that, for all three formants, the calculated k-values
reproduce the observed tenden-
cies to some degree. We may,
then, conclude that the chan
ges occurring during puberty
in the m ean c ase do not re
quire an active modi fication
of articul atory gestures to
t kn = Fn�/ Fnc!' 1.3 ,-----------------,
1.2
1.1
k3 1.0 L--_____________ --'
any large extent. Only with 1.3 ,-----------------, regard to labial o pening in
rounded vowels, do the present
calculations not provide an
unequivocal answer. By and
large, the normal combinations
of vocal fold length, vertical
larynx position, pharyngeal
cros s-sec t ional a r eas an d
overall openness of the vocal
tract a p pear to l eave the
phoneti c qual ity of s peech
sounds invariant. If the phy
siological changes were not in
harmony with the acoustic-to
phonetic tran sformation, we
woul d expect a predi sposition
for a s tead y vowel shift,
uniform in all languages.
1.2
1.1
10 -'-'-'-'-'-'-'-'-'
k 2 0.9 L--_____________ ......J 1.3 ,------------------,
1.2
1.1 t •
1.0 '-'-'-'-'-'--'-'-'--
k 1
0.9 L-____________ ----'
U 0 J a a � e e i y _ �
Figure 5. Female-male formant frequency ratios in
the vowels of Table 1 shown by diamonds. Connected
rings show result of present calcul ations simulati ng
female-mal e anatomical differences.
14
2 VOCALISM IN EASTERN CENTRAL BAVARIAN
(paper 6)
The perceptual experiments in which the subjects had to identify
synthetic vowels were performed with speakers of the Eastern Central
Bavaria n dialect (Ostmittelbairisch) of Amstetten, Austria. Speakers
of that dialect were chosen because of it's large number of different
vowels and finely shadowed distinctions of vowel openness (or vowel
"heigh t") which might contribute to a high reliability and "reso
lution" in the subjects' responses. The vowels are shown in Table 2.
+nasal T e ..... -- ...... u y F [ tt:] [ 01] ee Je CI! a D 0 oe
-nasal e E CI! a D J 0 u Y (J CI! cr JI 01 ul ie ee )e ue ye tEe
palatal + + + + - + + + + - + + + +
rounded - - - - - + + + + + + + + + + + + + + +
openness 1 2 3 4 5 4 3 2 1 1 2 3 4 3 2 1 1 3 3 1 1 3
Tabl e 2. Feature analysis of the non-reduced vowels in the E astern Central Bavarian dialect
of Amstetten, Austria. In stressed syllables, these vowels ar e short in duration
when followed by a fortis consonance, otherwis e they are long when stressed. Th e
nas alized vowels a re phon emic only when long. Thos e in brackets occur as allo
phones only. In diphthongs, the figure for openness refers to the initial segment.
In the Bavarian dialect region and most typically in Eastern Central
Bavarian, spoken in Austria north of the Alps, several socially stra
ti fi ed forms of 1 anguage are in use. There is a conti nuum of speec h
f orms between the rural dialect (or urban jargon) and the regional
for m of standard German. T he use of a certai n form of speech is not
rigidly linked to the social status of the speakers. The speakers also
choose their form of speech depending on several other factors such as
to pic and environment of the discour se. loans from hig h er or lower
ranked soci 01 ects fu1 fi 11 the functi on of express i ng respect or di s
tain towards somebody or something. There is also a particular rela
tionship between speech tempo and sociolect. The dialectal word forms
a re more reduced than the corresponding forms of standard German.
Therefore, the dialect is preferred in li vely speech, while a speec h
form closer to the standard is preferred in a more deliberate mode of
speech . This kind of "dynamic diglossia" is quite different from the
static type of diglossia observable in Switzerland where the standard
15
is used only in literary and ceremonial contexts, or in northern
Ge rmany, where the dialect largely has been superseded by the stan
dard.
The vowel systems of seven dialectal varieties are presented. As could
be expected, the dial ect of Vienna, the capital of the region, is
closest to the regional dialectal coine, a form of speech ranked
intermediately on the social scale. The dial ectal peculiarities in
c rease with increasing communicational distance from the capital . In
general, for all their speech forms, speakers stay within the frame of
a uniform vowel sys tem common to all the sociolects used in a given
region (We exclude here the small fraction of natives with an active
profi ci ency in the "Hochl autung"), though as for the preci se phoneti c
quality of vowels, some dif ferences between age groups can be ob
served.
Among the phonological rules, those concerning the voca lization of
Ill, Ir/, and Inl have particularly profound consequences on the
phonetic make-up of these dial ects. While these consonants cause some
feature(s) to be added to preceding vowel s, the cond itioning conso
nants themselves are consequently deleted. The four rounded front
vowel s [ y {6 CI! II] and the diphthoongs [ur or JI] arise by / 1 1-vocalization. None of these vowels occurs in underlying dialectal
forms. The diphthongs [iee e Jeueye (Ee] - there are more of them in
some dialectal varieties - are the products of Ir/-vocal ization. Al
though some of these diachronically in some instances have a different
origin and no Irl in the standard German equivalents, this can no
longer be derived by a synchronous analysis of any one of those dia
lects alone. The vocalization of Inl is a more widespread process. It
produces all the nasalized vowels (see Table 2). It is, however,
particular for the present dial ects, except that of Vienna, that
nasality is also distinctive in vowels preceding nasals. As compared
with the oral vowels, the number of distinctive degrees of openness is
reduced by one in the nasal ones. Th is is expl ained as being due to
the nasal antiformant in the regi on of Flo The di sturbance caused by
this antiformant reduces the number of degrees of openness that can be
distinguished auditively.
The application of several rules is dependent on speech tempo. These
rul es concern vowe l-reducti on and the l oss of certai n segments. An-
16
other case is monophthongization, which produces e.g. [iE1, [01, and
[CE] from underlying laII, laUI, and lall via rOY] . In lively speech,
this monophthongization can be observed in the whole dialectal area.
In deliberate speech, it characterizes the dialects of Vienna and its
wider surroundings. This is an innovation that has been traced to the
speakers of the Viennese jargon at the end of the 19th century. It is
probably true that the speak ers of any urban jargon are incl ined to
exaggerate the particularities of the local dialect. One means to this
end is the generalization of rules which ordinarily apply to lively
speech a lone. This kind of process may a lso explain certain delayed
substratum effects observed in historical documents.
The characteristic frequencies (fundamental and four formants) of the
oral monophthongs produce d b y 12 male speak ers of the dialect of
Amstetten were measured. The result,
of overlap in the pairs [ e e], see Figure 6, showed some degree
[� CEL an d [0 :>1. This over-
lap was confirmed by an audi
tory analysis resulting in
dubious categ o r iza t ion s in
26%, 13%, an d 6%, respective
ly, of these pairs. None of
the spe akers evidence d any
difficulty in percept u all y
di scri mi nati ng between the t wo
degrees of openness distingui
shing these vowels. The gene
ral merger of lei with lei and
of 1M with lCEI ;s otherwise
characteristic of the dialect
o f Vienna only.
2,5 2,0 1 ,5 1,0 0,5 0, 0 +--L...o....o.�.I.......i.... .......... ........J...� .......... .-.........J...-........... '--'--"'----'-
0,2
0,4
0,6
O,B
F 2 [kHz]
i Y u +--+--- -- '-- ---- ----,� e+���=�:== ____ � � -� ----+ CE J
(2+-+-----+_ CEo-! '0 F 1 ;;r-[kHz) a
Figure 6. FlfF2 diagram of vowels by male speakers of
the Eastern Central Bavarian d i alect of Am stetten. Tonotopically scaled coordinates. Mean values (rings)
and standard deviations (bars) of the formant frequen
c i es. Vowel s wi th the same degree of openness connec
ted by dashed 1 i nes.
For references see the particular papers.
17