IPO ANNUAL PROGRESS REPORT
Nr.12 1977
Editor: A.J. Breimer
Typist: Jeanneke van Esch
INSTITUTE FOR PERCEPTION RESEARCH - INSTITUUT VOOR PERCEPTIE ONDERZOEK
P.O. BOX 513 EINDHOVEN HOLLAND
TELEPHONE NATIONAL (040) 756605 / TELEPHONE INTERNATIONAL +3140 756605
Contentspage
2
5
6
8
11
24
29
34
41
47
55
58
63
69
74
Contents
Introduction
Research programme
Organisation IPO
Auditory perception
B. Leshowitz
Speech intelligibility in noise for listeners with sensorineural
hearing damage.
H. Duifhuis, J. Smits, J. v.d. Vorst and M. Scheffers
Further psychophysical data on two-tone suppression.
J. Thomassen
Preliminary experiments on accent perception in tone sequences.
B.L. Cardozo and K.G. van der Veen
Estimation of annoyance due to low-level sound.
Speech
H.F. Muller, S.G. Nooteboom and L.F. Willems
An experimental system for man-machine communication by means of
speech.
L.L.M. Vogten and L.F. Willems
The Formator: a speech analysis-synthesis system based on formant
extraction from linear prediction coefficients.
J. 't Hart
Pitch contour stylisation on a high-quality analysis-resynthesis
system.
A.F.V. van Katwijk
Auditory feedback as a factor in disrupted speech production.
D. Bouwhuis and J. de Rooij
Vowel length and the perception of prosodic boundaries.
R. Collier
The perception of English intonation by Dutch and English listeners.
S.M. Marcus
The IPO speech squeezing system.
IPO annua1 progress report 12 1977
81
87
92
96
100
105
115
11 7
124
130
132
137
141
Visual perception
F.J.J. Blommaert
Spatial processing of small visual stimuli.
H. Bouma, Ch.P. Legein and A.L.M. van RensVisual recognition by dyslectic children: response latencies for
letters and words.
U.O. SchroderBackward masking in a reading-like situation.
H. Timmers
Letter cancellation in words and nonwords.
H. Bouma, D.G. Bouwhuis and H. Timmers
Processing of visible language: a Symposium.
Cognition
H.C. Bunt
Towards an analysis of dialogue organization principles.
Knowledge of Dutch three-letter words.
CRec t i fi cation)
Ergonomics and perceptual aids
H. Bouma and F.L. van Nes
Legibility of rectilinear digits.
P.H. van der Heijden, H. Bouma, H.E.M. Melotte and F. Meyer
A typewriter for a motorically handicapped person, operated by
head movements.
H.E.M. Melotte
The IPO relief-drawing set.
Instrumentation
L.F. Willems, G. Moonen, C. Lammers, J. Dobek, A. van Nes and
H. Jimenez Nichols
Digital equipment: a number of examples.
U.O. Schroder
A controlled voice switch.
Publications
4
Introduction
This report or any part thereof may not be reproduced in any form wi thout the written
permission of the Institute for Perception Research. Illustrations may be reproduced
only with explicit mentioning of source; copies will be appreciated.
Introduction
The I.P.O. Annual Progress Report for 1977 greets its friends and colleagues and
presents them with a summary of the Institute's research in human perception and in
formation-handling in a technological era.
In November 1977, Dr. J.H. Bannier retired from the Supervisory Board. As a delegate
of Z.W.O., the Netherlands Organisation for the Advancement of Pure Research, the
Institute, from its foundation in 1957, has benefited from his wisdom. We express our
gratitude for his contributions to our work which have helped to establish and main
tain our links with related research groups. Dr. Bannier has been succeeded by
Mr. J. Smits, member of the Z.W.O. staff.
Thanks are due to all I.P.O. members who left us in 1977. Among them Mr. J.Chr.
Valbracht retired. He has provided us throughout the years with so many pieces of
electronic equipment.
The serious lack of accommodation was somewhat eased when a new wing in a neighbouring
building was put into use in August 1977. The new corridor between the two buildings
has changed the outward appearance of the whole.
Trends in our fields of interest are reflected both in the contributions to the pre
sent issue and in the research programme, listed on page 6-7.
Dr. Barry H. Leshowitz, during his sabbatical, helped us step up our research on
speech communication by the hard-of-hearing in noisy environments. Work has also
been intensified on the interaction between man and machine when using speech in
both directions and on perceptual properties of visual displays.
In the last named field,the Institute organised a symposium on "Processing of
Visible Language", from 5-8 September 1977. Research psychologists, graphic designers
and display engineers assembled to exchange views on the links between increased
understanding on the part of fundamental research and the ways of reading imposed
by present technology. A brief reflection on the symposium will be found on page 100.
As before, we shall be happy to maintain and extend contacts and cooperation with
our colleagues concerned in auditory and visual communication.
H. Bouma
IPO annual progress report IZ 1977
6
Research P'rogramme 1977/1978
IPO research is generally directed towards understanding intake and processing of
information by humans, in particular when they make use of technological means. Opti
mal design and usage of such technology are increasingly dependent on such insights.
Hearing
cochlear transduction Nonlinear processes in the cochlea are being studied
quantitatively by psycho-acoustic experiments such as masking for both
normal and impaired hearing. In the latter case, certain deviating types of
masking have been found which have implications for hearing aids.
sound control Perceptual evaluation of pleasant sounds, such as music, gives
rise to certain transmission and recording requirements. Of unpleasant
sounds ('noise') it leads to requirements for loudness and other sound
quali ties.
musical accents Properties of tone sequences leading to musical accents
are being studied.
Speech
connected speech Consequences of the available description of Dutch intonation patterns for pronunciation teaching are being considered. An at
tempt has also been made to develop a similar description of the intona
tion of British English. In a new approach the communicative value of pitch
accents will be considered. The relevance of time and frequency properties
of speech to speech communication are being studied.
word recognition The description of acoustic attributes used in the re
cognition of spoken Dutch words will be attempted.
speech processing Various facilities for analysis, synthesis and editing
of speech are available. The rapidly expanding technology for hardware and
software processing has made us step up our effort in this field, also as
regards perceptual consequences.
Vision
luminance contrasts We aim at a quantitative understanding of interactions
of stimuli close together in time and space. In the time domain, pulse-,
step- and frequency responses ar~ considered, in the space domain pointspread and edge-spread functions. Combinations of temporal and spatial
changes in the visual field are also being given attention.
image quality On the guidance of basic transfer functions of the visual
system, physical parameters of electronic displays are being considered
in order to understand and improve display quality.
IPO annual progress report 12 1977
visual selection We are concentrating on visual conspicui ty, defined asproperties of stimulus and surround and the strategy of visual searchconnected with it.
Reading
Reading processes are being investigated both in optimally presented text
and electronic text displays where factors such as size and contrast are
often not optimal. For dyslectic children, we study the coordination of
information from the two eyes.
Cognition and Communication
word recognition In recogni tion processes, perceptual analysis combines
with available knowledge. We are studying availability of words outside
context for incorporation in a quantitative theory of the visual recognition
of short words.
informational dialogues To obtain well-defined information from a rich
source, short series of questions and answers may be used. The structure
of such dialogues is being considered in terms of semantic notions. We
also study practical dialogues for communication between users and in
formation automata.
Ergonomics
Research is directed at anticipating ergonomic consequences of the appli
cation of new technologies such as in man-computer interaction. The develop
ment of certain new types of industrial product is being supported.
Aids for the handicapped
Certain new communication aids for people with visual, auditory or motor
handicaps are being initiated, developed and tested. Using existing pro~
duction and distribution channels, we work towards the goal of bringing
aids of proved usefulness within the reach of all who need them.
7
Organisation IPO
Dr. Ir. P. EijkhoffDr. J.P. van de GeerDr. H.E. Henkes
Dr. L.F.W. de Klerk
Dr. S.L. Kwee
Dr. W.J.M. Levelt
Dr. Ir. R. PlompIr. a. Rademaker
Dr. R.J. Ritsma
Dr. H. Schul tink
Dr. Ir. H. SpekreijseDr. P.C. Veenstra
Supervisory board(31.12.1977)
Scientific board(31.12.1977)
Director
Deputy director
Adviser
Prof. Dr. C.E. Mulders (chairman)Prof. Dr. W.A.T. Meuwese
Prof. Dr. J.F. SchoutenDrs. J. Smits
Dr. Ir. K. Teer
Prof. Dr. H.B.G. Casimir (chairman)
Prof. Ir. R.G. Boiten
Prof.
Prof.
Prof.
Prof.
Prof.
Prof.
Prof.
Prof.
Prof.
Prof.
Prof.
Prof.
Prof. Dr. C.J.D.M. Verhagen
Dr. Ir. P.L. Walraven
Dr. P.A. van Wely
Prof. Dr. P.J. Willems
Prof. Dr. Ir. A. van Wijngaarden
Dr. H. Bouma
Drs. B. L.Cardozo
Prof. Dr. A. Cohen
- Heeze
- Delft
- Eindhoven- Leiden
- Rotterdam
- Tilburg
- Eindhoven
- Nijmegen
- Soesterberg
- Eindhoven
- Groningen
- Utrecht
- Amsterdam
- Eindhoven
- Delft
- Soesterberg
- Eindhoven
- Tilburg
- Amsterdam
- Utrecht
Group leaders Dr. Ir. H. DuifhuiS)Dr. S.G. Nooteboom
Dr. Ir. J.A.J. Roufs
Drs. D.G. Bouwhuis
Ing. F.F. Leopold
H. E.M. Melotte
Ir. L.F. Willems
- Hearing and Speech
- Vision and Reading
- Cognition and Communication
- Ergonomics
- Communication Aids for the Handicapped
- Instrumentation
8
Research associates Ing. H.J. BleilevenIr. F.J.J. Blommaert
P.M. Boers
Ir. A.J. Breimer
Ir. J.P.L. Brokx (z.w.a. *)
Drs. H.C. Bunt
/PO annua / progress report /2 /977
Research staff
Secretaries
Ubrary
Workshop
Dr. R. Collier (part-time)J. 't Hart
lng. Th.A. de Jong
Dr. A.F.V. van Katwijk
Dr. Ch.P. Legein (part-time)Prof. B.H. Leshowitz+(for 1 year from Arizona State University)
Dr. S.M. Marcus
lng. G.J.J. MoonenH.F. Muller
Dr. lr. F.L. van NesDrs. J.R. de Pijper (Z.W.O. *)Drs. J.J. de Rooij+ (Z.W.O. *)
lr. U.O. Schr6der (Z.W.O.*)
Drs. J.M.E.W. Thomassen (Z.W.O.*)Drs. H. Timmerslng. J.C. Valbracht+
lr. L.L.M. Vogten
lng. E. de Braal
lng. J.J.G.M. Dobek
G.J.N. Doodemanlng. J.e. JacobsC.A. Lammers
A.W.J.J. MelchersA.C. van Neslng. J.A. Pellegrino van Stuyvenberg
lng. J. PolstraA.L.M. van Rens+
K.G. van der Veenlng. P. Ytsma
H.W. Zelle
Ms. M.A. Boerrigter
Mrs. J.A.C.E. van Esch-van der VleutenMrs. C.J. Mennen-SenkeldamMrs. C.E.A.L. Nuys-van de Water
Ms. R.M. Smith
C.G. Basten
J.H. Bolkestein
P.A.N. BroekmansA.L.M. de CocqC.Th.P. Godschalx+
D.J. van der Wees
+ Left during 1977
* Netherlands Organization for the Advancement of Pure Research 9
10
Auditory perception
Speech intelligibility in noise for listeners with sensorineuralhearing damage
B. Leshowitz
Introduction
The great majority of hearing-impaired listeners suffer from physiological deficits
present at the sensory cell level in the cochlea. These disorders are often termed
"sensorineural" and cannot be medically treated either by surgery or drug therapy.
The only assistance available to listeners with sensorineural hearing damage is
amplification of the sound afforded by the personal hearing aid. Unfortunately,
efforts to ameliorate hearing disorders with amplification have not been entirely
successful. Indeed, almost every listener with sensorineural hearing loss reports
that, under many everyday conditions of background noise, speech reception is not
improved by the hearing aid.
The foregoing does not contradict the audiological observation that the hearing aid
often performs a useful function. Under the restricted condition of near total quiet,
amplification of the sound, consisting mainly of the desired speech signal, above
the threshold of audibility does allow good speech reception. How much of the listener's
acoustical day consists of substantial quiet is of course, the fundamental issue.
Examination of the noisy communication situation experienced by the hearing-impaired
listener reveals a straightforward explanation of the widespread dissatisfaction
with the personal hearing aid. It is well-known that a concomitant of sensorineural
hearing loss is an increase in the ratio of speech level to noise level required
for just intelligible speech over that measured for the normal-hearing listener.
Measured in terms of speech-to-noise ratio, the hearing deficit may be as large as
10 to 15 dB (Plomp, 1978). When it is realized that in many common moderately noisy
communication situations, the normal listener must process speech at or near the
threshold of intelligibility, one can begin to understand the magnitude of the com
munication problem faced by the listener with impaired-hearing. The personal hearing
aid cannot possibly improve the quality of partially masked speech since it provides
indiscriminate amplification of both speech and noise. Thus, while the listener with
a hearing deficit may report the presence of a speech signal, unless the speech-to
noise ratio exceeds some critical value, the listener will perceive the speech as
muffled and unintelligible. Effectively he will be deaf in all but the most ideal
listening conditions.
An auditory disability is generally assumed to be present when the listener cannot
engage in tete-a-tete conversation in quiet. Without minimizing this hearing activity,
a cursory analysis of the typical listener's acoustical day reveals that the majority
of speech communication takes place in ambient noise. Assessment of hearing capacities
from pure-tone threshold and speech intelligibility measurements made in quiet can
therefore hardly be expected to capture the everyday acoustical experiences of the
hearing-impaired listener. 11
IPO annual progress report 12 1977
Implicit in our emphasis on measurement of speech reception under realistic conditions
of moderate background noise is the assumption that the standard evaluation of hearing
in terms of the audiogram provides an inaccurate description of the listener's audi
tory capacities. On the assumption that threshold elevation in the speech region
measured in the quiet is the best predictor of speech reception, it is common audio
logical practice to evaluate the hearing impairment by averaging the pure-tone thres
holds at 500, 1000 and 2000 Hz. While there is little doubt that this index of hearing
loss does indeed predict the speech reception threshold in quiet, there is much less
justification for its continued application under the less ideal conditions of mo
derate background noise. That listeners with normal low-frequency hearing and se
lective high-frequency hearing loss often experience great difficulties in perceiving
speech in noise has frequently been reported in the anecdotal observations of practi
cing audiologists (Courtois, 1975). Specifically, listeners with an abrupt high
frequency loss due to noise trauma as well as the presbyacusic patient characterized
by a more gradually sloping audiogram are the two major categories of the hearing
impaired population thought to be especially vulnerable to noise.
The major aim of the present investigation was to determine whether there is a well
defined class of listeners incapacitated by their inability to understand speech
in noise to an extent far in excess of what would have been predicted from inspection
of the audiogram.
Positive findings would, it was felt, provide strong evidence that hearing handicap
is far more prevalent in the general population than has heretofore been recognized.
Having established the existence of an appreciable population of noise-sensitive
listeners, efforts could then be directed to developing a psychoacoustical framework
for understanding the speech communication handicap.
In the audiological literature the claim is often made that there is an increase in
the upward spread of masking for listeners with sensorineural hearing deficits. Thus,
we speculated that increase in the speech-to-noise ratio observed for listeners with
selective high-frequency ~earing loss ought to be accompanied by a marked increase
in masking above that measured for normal-hearing listeners. Moreover, in view of
the speech communication handicap reported for these listeners, the enhanced masking
effect ought to take place in the low-frequency region of speech where pure-tone
thresholds are normal. A positive relationship between the pure-tone masking pattern
and speech reception in noise would, we reasoned, not only have obvious practical
implications for audiological procedures for assessing the effective hearing handi
cap, but would also constitute an intriguing research problem for the physiological
acoustician interested in basic hearing mechanisms. Although the physiological
correlates of behavioural thresholds are not well established, it is not unreasonable
to infer minimal sensory-cell loss in regions of normal hearing. Accepting this
supposition, then what underlying processes are responsible for the abnormal supra
threshold effects observed in seemingly normal regions of hearing?
Experiments
The primary strategy of the research was to assess intelligibility of speech presented
12 against a noise background for listeners with selective high-frequency hearing loss
and near-normal hearing in the speech frequencies. A second goal was to relate speech
reception in noise to the pattern of pure tone masking for individual listeners.
Subjects
Listeners with predominantly sensorineural hearing damage served as the experimental
group. When the air-bone gap exceeded 20 dB, the conductive loss was assumed to be
sufficiently great to eliminate the subject from participation in the experiment.
The pattern of hearing revealed by the audiogram in conjunction with the listener's
case history provided the basis for classifying the experimental subjects according
to etiology of the hearing loss. Listeners categorized as having "noise trauma (T)",
for example showed a precipitous loss in the high frequencies and reported that they
had been exposed to appreciable levels of environmental noise. Presbyacusic subjects
(P) had a more gradually sloping audiogram and had no history of noise exposure. Two
female subjects had received extensive medical treatment with a mycine drug, and
w~re classified accordingly (D).
The control group (N) consisted of listeners in the under-40 age group with normal
audiograms. A second group (over-40) produced largely similar data often with a
slight trend to typical presbyacusic results; those results are not presented in this
paper.
Typical audiograms are shown in the results section.
Procedure
Speech intelligibility was measured for each listener using connected discourse as
the signal. The task of the subject was to adjust the level of the interfering back
ground (foyer noise) until the speech signal was "just intellirible If. The listener
was informed that it was not necessary to recognize every word of the text, but only
to maintain the intelligibility of the story-line. A Bekesy up-and-down psychophysical
method was used for this purpose, wherein the subject continuously varied the level
of the background noise using an attenuator having a range of 40 dB and steps of 2 dB.
The level of the speech constituted the experimental variable and was varied between
the listener's speech reception threshold in quiet and 90 dB(A). (For a more complete
explanation of the application of the adjustment procedure to measurement of speech
intelligibility, the reader should consult the recent paper of Plomp (1976).)
The extent of the upward spread of masking for each listener was assessed in a pure
tone masking experiment. In this paradigm, the masker was a 125 msec, ramps included,
gated sinusoid, presented with a linear 25-msec rise-fall time. The masker was pre
sented continuously throughout the listening session with a duty cycle of 0.5. The
signal was a 20-msec tone burst which was shaped with a 10-msec rise-fall time. The
signal was presented in the temporal centre of the gated masker during three successive
masker bursts and deleted every fourth burst. A listening session was devoted to
the measurement of a masked audiogram. The frequency and intensity of the masker
was held constant. The experimental variable was the frequency separation between
signal and masker. 13
In all experiments, the tonal masker was a 1000 Hz sinusoid, presented at either 75
or 105 dB SPL. Masked thresholds were measured using an adjustment procedure similar
to that described earlier in connection with measurement of speech intelligibility.
The subject was instructed to adjust the level of the probe using a 2-dB-step
attenuator until the probe was at the "threshold" of audibility. Since the intent
of the experiment was to examine the spread of masking into the higher frequencies,
precautions had to be taken to prevent the listener from basing his threshold
measurements on perception of low-frequency combination tones caused by simultaneous
presentation of signal and masker. Thus, at frequency ratios of signal (fs ) to
masker (fm) between 1.2 and 1.5 an additional low-frequency masking tone at either
2f - f or f -f was added. The level of these added tones was at least 20 dBm ssmbelow the primary masker and therefore played a negligible role in determining the
course of masking.
Results
In order to facilitate comparison between psychoacoustic and speech-intelligibility
measures of performance, masked and unmasked audiograms are plotted along with the
speech intelligibility functions. For each group are shown a typical subject (data
points) together with total and interquartile ranges across subjects (shaded).
In Fig. 1 are presented the data for listeners having normal hearing (N), as evi
denced by normal pure-tone thresholds throughout the entire audible frequency region
shown in the upper-left panel. The quality of speech intelligibility at various
levels of background noise is given in terms of the speech-to-noise ratio or masked
speech-intelligibility threshold and is depicted in the upper-right panel of each
figure. In general, it can be seen that intelligibility thresholds of about - 5 dB
are obtained at low and moderate noise levels, with a slight increase in threshold
apparent at the higher levels of background.
The pattern of pure-tone masking is depicted in the masked audiograms which are
plotted in the lower panels of Fig. 1. The masked audiogram relates signal threshold
to frequency of the signal presented against a longer-duration 1000 Hz tonal back
ground. As expected, we observe the asymmetry of the masked audiogram which is con
sistent with the standard observation of an upward spread of masking into the higher
frequencies. Comparing the masking patterns obtained at the two masker levels of
75 and 105 dB, we note that the spread of masking becomes more prominent with an
increase in the level of the masker. The present masking results, while in good
qualitative agreement with the classical findings of Wegel and Lane (1924), are, we
feel, noteworthy insofar as they demonstrate that perfectly reasonable masking
results can be obtained using a method of adjustment with listeners having only
minimal experience in psychoacoustical experimentation. In view of the consistency
of the measurements of masking across listeners comprising the control group, it was
deemed appropriate to compute an averaged masked audiogram. The latter serves as a
standard measure of performance against which masking for the hearing-impaired
listeners are compared and is indicated by the dashed line in the appropriate14 figures.
10090
CLASS:NORMAL (n-8) B
60 70 80SPEECH LEVEL (dB(A))
5040N
+10
-III.:so 0
~a:z.....(/) -1
10 205Q.5 1 2FREQUENCY (kHz)
0.1 Q2
.I I A
m-• """...•.... r"'lI--_.
~.
8
4IV1 N
iii.:s(/)(/)
9Clz~W::I:
~
80r-----.-I""'"'"T..,...,............-----r-~T""'T'""T"T"1"'TT'"____::C:-l
M
110.-----r-T"'""'1r--r-"T"T"1r"TT'"---r---r-r-,,-T'TT,-----:0:-"1
M
Q.5 1 2 5 10 20SIGNAL FREQUENCY (kHz)
_90..J0.(/)
III.:sClzs;::(/)ct:E7
Q.5 1251020SIGNAL FREQUENCY (kHz)
0:1 Q22 N
Fig. 1. Combined data for listeners with normal hearing.Panel A shows audiograms, panel B gives speech intelligibility thresholds in noise(as SiN-ratio) as a function of speech level, and panels C and D show masked audiograms for 1 kHz maskers of 75 and 105 dB, respectively. Throughout, data point represent data of one typical subject, dashed lines averages across subjects (n = 8)and hatched areas indicate interquartile and total ranges.
In Figures 2, 3 and 4 are plotted the performance functions for listeners classified
as "noise tramata"(T), "presbyacusic" (P) and"ototoxic" (D) , respectively. From in
spection of these data several trends emerge, which, we feel, characterize the
listening performance of listeners with selective high-frequency hearing loss due to
sensorineural hearing damage. Most important, we note that according to the AverageHearing Loss index (as deducible from the audiogram) all of our impaired
listeners would be considered to have near-normal hearing. That the audio grams do
not accurately portray the hearing capacities of these listeners is immediately
obvious from examination of the suprathreshold performance functions for masking
ana speech intelligibility. First, the masked audiograms (solid curve) reveal con
siderably more upward spread of masking than that obtained from normal listeners
(dashed curve). Moreover, this enhanced spread of masking observed for impaired
listeners is manifested in frequency regions where pure-tone thresholds in quiet
are perfectly normal. While the reader can confirm this observation from inspection 15
100
B
9060 70 80SPEECH LEVEL (dB(A»
---------------_._--------------------_... " .......
50
CLASS:TRAUMA(n07)+10
Iii:20
~0
II:
Z.....1Il_1
T
20 40
CJ40H--t---t----t---tZ
~6OIH--+--+-I-I--\+w:I:
t8 H---+--j--j----+-+-+--+--I¥-+...-=+---I1000~;;."---:O=':2;-.L-'--'-:0=':.5"""""'-'-:1!-...L--:!:2-.L-J....:.:~...L.,L"J-~
FREQUENCY (kHz)
c:Jz12III<t:E4
\\\"\
\\\
"\\\\"\.._-
c
10 2020.'::-T----;!:".--'-----'-,:;!"......J...J...'-!-----=--J-"---!:,-J--Jl..J....L~-."J.
0:1 Q2 as 1251020SIGNAL FREQUENCY (kHz)
5O!-;T_--='~-'-...I....::'="'_.J....L'_!__-~-"'-J.......I--'-Li..J.l.--....J0:1Q2 05125
SIGNAL FREQUENCY (kHz)
Fig. 2. Combined data for listeners with noise trauma. Lay-out of panels as inFig. 1.
of any impaired subject's masked audiogram, let us illustrate the point by referring
to subject JB's performance. In Fig. 2a we observe that in the low-frequency region,
thresholds in quiet are normal, whereas above 2000 Hz a precipitous fall-off in
hearing sensitivity occurs due to acoustic trauma. Concentrating our attention on
the region above the masker at 1000 Hz and below the area of threshold elevation,
it can be seen from the masked audiograms that masking is about 20 dB greater than
is recorded for normal listeners. From examination of the masking patterns of other
listeners with selective high-frequency loss, we observe elevations in masking ~s
large as 40 dB. Moreover, this enhanced masking effect often occurs in frequency
regions remote from the region of hearing loss where pure-tone thresholds are normal.
The speech intelligibility signal-to-noise thresholds are plotted in panels D. There
is anything from a S to lS dB increase in the speech-to-noise ratio relative to
performance of normal listeners. Moreover, the increase in the SiN is observed
16 throughout the listener's available range of hearing, thus indicating that the aural
CLASS:PRESBYACUSIS(n=14) BA
SlMli:s1Il1Il2gC)z~6QH---t--+-+-+-FwJ:8OIH---t--+-+-+-F+--++-+t1100!:7I----="~...........-='=......w~--...J.~--...J.---l.~u..J..L.L..L..-....J
0:1Q2 as 12 20FREQUENCY (kHz)
+10
S:sQ O~lZ:.Z....1Il_10
p
40 50 60 70 80SPEECH LEVEL (dB(A))
90 100
II
\II,\. r--~_J
3
20bP----.!':<--'-........1;:J-..L..J..>+------,\,...-.J---'----i......-'--'-''i><--.,.J0:1 Q2 1 2 5 10 20
SIGNAL FREQUENCY (kHz)
50L:..P_-='=,------~'"='=;_L_L...&...LJ:--~---'---'-=-'"...L..L~-__='0:1 Q2 as 1251020
SIGNAL FREQUENCY (kHz)
Fig. 3. Combined data for presbyacusic listeners. Lay-out of panels as in Fig. 1.
overload attendant to high-level stimulation cannot account for loss of speech in
telligibility experienced by listeners with predominantly high-frequency hearing
loss. A rise in speech threshold seems to be related to a significant upward spread
of masking.
Discussion of results
In summarising the main experimental findings the following results should be
emphasized: (1) for normal hearing adults, the good agreement between the present
data and similar findings in the literature is evidence of the suitability of the
psychophysical procedure of adjustment to measurement of both speech intelligibility
and pure-tone masking for inexperienced listeners; (2) relative to the normal control
group, listeners with selective high-frequency hearing loss show as much as 40 dB
more masking at high frequencies; (3) the increased upward spread of masking ob-
served with impaired listeners is often found in regions of normal pure-tone thres- 17
A
:------...;~ .....-'
~ v--' \~ J"...
~
1\
1'\1: Gy.dG.~.e.) \
o • T.G.(I.e 1\
CLASS:DRUGS(n=2) B
10090
-- ------_......-...---
60 70 80SPEECH LEVEL (dB(A»
----------------------
1: G.'LdG.lr.e.)o. T.G.(~.)
40 SO
iii:sOOl--+---+----o--+----+-+-~+---+-__+_-t___--+______j
eia:.z.....1Il.
10 20as 1 2 SFREQUENCY (kHz)
Q2
·20
iii 0:s1Il920
~3O
~4Ow
ISO
t 600:1
80 11C D
70 100
60 _90::i ..I0- 0-1Il 1IlIII:SSO III
C) :Saoz C)
S2 Z1Il 52ce 1Il
== 40 ~7
30 60
1:G~-f-~e) 1: G.'LdG.(r.e.)20 o· T.G. so o· T. G. (I.e.)
0:1 Q2 as 1 2 S 10 20 0:1 Q2 o.s 1 2 S 10 20SIGNAL FREQUENCY (kHz) SIGNAL FREQUENCY (kHz)
Fig. 4. Combined data for two ototoxic listeners. Further lay-out as in Fig. 1.
hold; (4) masked speech-intelligibility thresholds of impaired listeners are in the
order of 5 to 15 dB worse than normal observers; (5) whereas measurement of hearing
in terms of the traditional Average Hearing Loss index frequently does not serve
to distinguish members of the experimental and control groups, suprathreshold mea
sures of speech intelligibility and pure-tone masking show essentially no overlap
between the two groups.
Although both the smallness of the number of subjects partaking in the present ex
periment and the limited stimulus parameter space investigated limit the generality
of the present findings, the potential implications of the results are so far
ranging that they deserve detailed consideration. Following a discussion of how the
present results bear on our notions of mechanisms underlying auditory masking, more
applied audiological problems of diagnosis and treatment of perceptual handicaps are
considered. In the concluding subsection are presented some final observations on how
the present laboratory findings may influence the reformulation of medico-legal
18 statutes governing determination of hearing handicap.
The upward spread of masking and sensorineural hearing loss
Investigations directed at measuring pure-tone masking for listeners with sensori
neural hearing loss have not prdduced a consistent picture of auditory masking in
the impaired ear. For example, De Boer and Bouwmeester (1974), who investigated
masking patterns produced by narrow bands of noise, have demonstrated increased a
mounts of masking above the passband of the masker in ears with cochlear pathology,
presumably attributable to pronounced harmonic distortion. Nelson and Bilger (1974),
on the other hand, have obtained equivalent masking for probes placed at the octave
above the pure-tone masker. Reports in the older clinical literature, unfortunately,
do not help us to understand the discrepancy between the two reports.
In an attempt to resolve the apparently contradictory findings Leshowitz and Lind
strom (1977) investigated the spread of masking in regions of normal and elevated
threshold within the same ear. The shape of the psychoacoustical tuning-curve,
relating the level of tonal masker just sufficient to mask the fixed frequency probe,
served as the measure of frequency resolution. They observed a complete loss of
frequency selectivity in regions of threshold elevation, whereas in the normal
regions of hearing, tuning curves appeared to be similar to those of normal
listeners. In a second experiment, however, they obtained evidence indicating that
in normal regions of hearing there were aspects of auditory processing that had
been altered by the presence of a lesion in a remote region of the cochlea. It is
well known that presentation of two primary tones gives rise to perception of
additional tones, called combination tones, not present in the stimulus. In listeners
with abrupt high-frequency hearing loss, Leshowitz and Lindstrom observed a marked
reduction in the generation of odd-order combination tones although the primary
tones were both presented in regions of presumably normal hearing.
Additional evidence of abnormal auditory processing in seemingly normal regions of
hearing was also obtained in the present experiment. From inspection of the masked
audiogram it has been observed that when both the probe and masker were located in
regions of normal hearing, masking was as much as 30 dB greater than observed in
listeners with completely normal audiograms. We are hard pressed to account for
this observation in the light of the generally accepted view that there is minimal
sensory cell degeneration in regions of normal behaviour threshold. On the assumption
that a near-normal complement of sensory cells exists in regions of normal threshold,
we are forced to conclude that the physiological insult is more subtle than has here
tofore been realized. Investigations of the physiological correlates of the pronounced
masking effect observed in regions of normal threshold in the traumatized ear pre
sent an intriguing and challenging opportunity, one that has relevance not only to
a fundamental understanding of the mechanism of auditory masking, but also to
understanding the speech handicap experienced by listeners with selective high
frequency hearing loss.
Although we cannot offer speculation about the mechanism underlying the pronounced
spread of masking in ears with cochlear pathology, the empirical finding may offer
insight into the basis of the speech communication problem characterizing listeners
with sensorineural impairment. 1£
In earlier attempts to account for the communication loss reported for listeners
with selective high-frequency hearing damage, the threshold dip has been regarded
as selective attenuation of high-frequency information .. This, together with the
limited dynamic range (i.e. recruitment) in the region of threshold elevation
has been held responsible for the loss of speech discrimination in noise. This view
is consistent with the stated need for selective amplification and automatic gain
control in hearing aids.
Quite a different explanation of the speech deficit is suggested by the masking
patterns obtained with impaired listeners. As a working hypothesis we suggest that
the pronounced spread of masking into the high-frequency regions, including areas
of normal threshold, attendant' on the threshold dip is a major causative agent for
the listener's communication loss in noise. In order to test this notion, a simple
demonstration experiment was conducted. A typical high-frequency hearing loss was
simulated in a normal listener by presenting a 3000 Hz continuous tone along with
the discourse. As befor-e, the task of the listener was to adjust the level of the
noise until the speech just became intelligible. Addition of the tone was found
to have no effect on speech intelligibility. Thus, we conclude that the loss of
high frequency information in the speech waveform is due to threshold elevation and
that increased recruitment plays only a secondary role in determining the quality
of speech perception. It is clear that detailed measurements of the frequency se
lectivity and auditory nonlinearities attendant on hearing impairment are needed
before we can reach a more fundamental understanding of the hearing disorder.
Assessment of hearing handicap
The hearing handicap, as argued, can be usefully expressed in terms of the diffi
culty to hear and communicate in an everyday situation of ambient noise. A quanti
tative index of the individual'sperformance in such a situation is his speech
intelligibility threshold determined against a background of noise (expressed as
signal-to-noise ratio). Insight into the value of this index is gained by analyzing
the everyday situation of two competing sound sources, One a primary talker and the
other unwanted interference from another speaker or TV, etc. Plomp (1976) has shown
that, under ordinary room acoustics and placement of the two sources, the primary
sound is about 5 dB above the comfort level for understanding of conversation by the
normal-hearing listener. We have seen earlier that listeners with selective high
frequency hearing loss have masked thresholds between 5 and 15 dB higher than the
normal control group. Thus the prediction is that almost all our non-normal experi
mental listeners will have serious, if not insurmountable difficulties in under
standing the speaker in noise.
The above hearing handicap can be estimated directly by measuring the listener's
masked speech-intelligibility threshold. However, the apparent relationship between
speech discrimination in noise and the prominence of the upward spread of masking
suggests that it may well be possible to predict the handicap from audiometric
measurements of masked threshold, thereby bypassing the difficulties inherent in
20 speech-intelligibili ty measurements.
Using the averaged masked audiogram obtained for the 75 dB-l000 Hz masker for
normal observers as the baseline, we can quantify the degree of the upward spread
of masking in individual listeners by averaging the elevation of masked threshold
at 1500, 2000 and 3000 Hz above the normal masked thresholds. The averaged masked
threshold elevation, which we shall call the Handicap Index (HI) in conjunction
with a simple decision rule can now be used to predict speech intelligibility.
Fig. 5 is a scattergram depicting masked speech-intelligibility threshold,averaged
across speech levels of 65 - 95 dB, plotted against HI for all the listeners partaking
in the experiment. If we accept for a moment that a 5 dB elevation of intelligibility
threshold gives rise to a significant hearing handicap, we can evaluate various de
cisions. To illustrate the approach, assume that we adopt a rule whereby all listeners
having an HI greater than 10 dB are classified as handicapped. From Fig. 5 we observe
that the probability of correctly identifying a handicapped individual is close to
unity. Unfortunately, a few individuals are incorrectly classified as having a handi
cap. Nevertheless, the outcome is deemed quite respectable, especially when it is
realized that according to the Average Hearing Level measured in the quiet all listeners
would be considered normal. Assessment of hearing handicap with the HI approach, while
very promising, must be the substantiated in many additional tests before we can re
commend the procedure as a clinical diagnostic tool.
40r----r----r----,,-----,------,
•...
• -• ......• .....- ...
x -.... .. ..• ..-A •.. -• A
..xx class drugs... ,. presbyacusis
A A old normal
• • J...•• • trauma _•• A • normal
- inherited
-5 0 5 10 15AVERAGE SIN RATIO CdBI
_30..J0.Ul
ED
"-20-:t)(wo:!!;1Or-0.e:(Uoz Ofe:(:I:
_10L--~_L___-----'L_______I ____I ___'
-10
•
Fig. 5. Scattergram of Handicap Index (i.e., the average "supranormal" upwardspread of masking obtained at 1500, 2000 and 3000 Hz) against masked speech intelligibility threshold.
Possibilities for rehabilitation
Practically, all modern hearing aids are designed to produce either selec~ive or broad
band amplification of sound. Since wanted speech signal and unwanted background
undergo equal amplification, the signal-to-noise ratio is at best unaltered, (in
practice it decreases somewhat due to noise in the instruments). This implies that
listeners with a flat speech intelligibility loss (all our non-normal listeners) 21
benefit from the traditional hearing aid only in a situation where background is
negligible to start with (Plomp, 1976).
If we are to significantly improve the speech-reception capabilities of our typical
impaired listener, the signal-to-noise ratio must be increased by about 10 dB.
A straightforward approach to solving the hearing dilemma entails detaching the
microphone from the hearing aid. By giving the microphone to the speaker, a consi
derable improvement in the SiN can be realised. From the inverse-square law of
physics we know that in the direct sound field acoustic signal strength decreases
6 dB per doubling of distance. At a comfortable speaker-to-listener distance of
two metres, the loss of signal strength is about 10 dB. Since ambient noise in a
room is about the same level at all locations, bringing the microphone near the
speaker improves the SiN by 10 dB. This 10 dB improvement in the SiN provides the
margin between intelligible and unintelligible speech and is sufficient to enable
most hearing-impaired listeners to understand speech about as well as the unaided
normal individual.
Present research efforts of the author and others at the Institute for Perception
Research are now being directed at developing a detachable microphone system that
not only delivers the required SIN, but is also ergonomically acceptable to the
user. The detachable microphone system incorporated in the personal hearing aid has
obvious shortcomings; none the less, the concept has received great acceptance as an
audit·ory trainer in schools for the hard-of-hearing. Borrowing from the technology
of light transmission of TV audio developed by Sennheiser, a prototype for the de
tachable microphone system utilising infrared light transmission of the acoustical
signal has been developed. The system consists of two components: a transmitter,
which includes the microphone; and a receiver comprising an infrared light receiver,
an amplifier and an earphone. A detailed report of the initial evaluation of the
prototype is beyond the scope of this report. However, it can be stated that we are
most encouraged by both the improvements in speech intelligibility rendered by the
device and the enthusiastic endorsement that the users have voiced in support of thedevice.
Conclusion: toward a reevaluation of the medico-legal standards of hearing handicap.
It should be abundantly clear that the basic message of the present work is that
evaluation of hearing capacities must take place under realistic conditions of
moderate background noise. That listeners with audiological evaluations of "normal"
hearing based onpure~one thresholds experience serious difficulties in underst~nding
speech in noise can hardly be unknown to the practicing audiologist. What is pain
fully obvious is that existing procedures for assessing the hearing handicap stillemphasize measurements taken in quiet and therefore do not capture the daily audi
tory experiences of the impaired listener. Hearing handicaps are often subtle, being
manifested most acutely in noise and are undetectable using standard audiological
techniques. Likewise, legal standards of damage-risk criteria for environmental noise
22 give carte blanche to insulting agents that do not intrude in the region of the
speech frequencies. In other words, high-frequency hearing is legally expendable.
The very profound deficits we have uncovered for listeners with exclusively high
frequency hearing loss clearly demonstrate how ludicrous is the prevailing view ofhearing handicap.
In conclusion, perhaps the major problem in the impaired-hearing field is to reach
a consensus as to what constitutes hearing handicap. Having established an acceptable
standard for evaluation of auditory capacities we can then proceed with research
directed at developing effective prosthetic devices. And, perhaps more important
still, we can begin devising legal standards governing "safe" levels of environmentalirritants.
"An ounce of prevention is worth a pound of cure:"
Summary
In this paper experiments are described ~o establish empirical evidence for the
audiological observation that listeners with normal pure-tone thresholds below 2000 Hz
and selective high-frequency sensorineural hearing loss often experience great
difficulty perceiving speech in a noise background. For patients with either noise
trauma or presbyacusis, masked-speech intelligibility thresholds (SIN) were about
10 dB higher than for normal observers. In an effort to provide a psycho-acoustical
explanation for the speech communication deficit, pure-tone masking patterns were
measured. Relative to the normal control group, listeners with high-frequency
hearing loss showed as much as 30 dB more upward spread of masking, often in fre
quency regions of normal pure-tone threshold. The strong positive relationship be
tween the masked-speech intelligibility threshold and the upward spread of masking
suggests that it may be possible to predict the patient's speech perception handicap
in noise from audiometric measurements of masked threshold. Implications of the pre
sent work for development of close-talking-microphone hearing aids are indicated.
References
de Boer, E. and Bouwmeester, J. (1974) Critical bands and sensorineural hearing loss,Audiology, 11, p. 236-259.
Courtois, J. (1975) Binaural IROS fitting of hearing aids. Scandinavian Audiology,supp!. 5, p. 194-230.
Leshowitz, B.H. and Lindstrom, R. (1977) Measurement of nonlinearities in listenerswith sensorineural hearing loss, In: E.F. Evans and J.P. Wilson (Eds.) Psychophysics and Physiology of Hearing, Academic, London.
Nelson, D.A. and Bilger, R.C. (1974) Pure-tone octave masking in listeners with sensorineural hearing loss, J. Speech and Hearing Research, 12, p. 252-269.
Plomp, R. (1976) Binaural and monaural speech intelligibility of connected discoursein reverberation as a function of azimuth of a single competing sound source(speech or noise), Acustica, ~, p. 200-211.
Plomp, R. (1978) Auditory handicap of hearing impairment and limited benefit ofhearing aids, submitted for publication.
Wegel, R.L. and Lane, C.E. (1924) The auditory masking of one pure tone by anotherand its probable relation to the dynamics of the inner ear, Phys. Rev., ~,p. 266-285. 23
Further psychophysical data on two-tone suppressionH. Duifhuis, J. Smits, J. v.d. Vorst and M. Scheffers
Introduction
We have previously reported theoretical (Duifhuis, 1976) as well as experimental re
sults on two-tone suppression (Duifhuis, 1977). So far the experimental psychophysi
cal data were limited to two-tone suppression for a suppressee frequency of 1 kHz.
Data have been obtained exclusively with the pulsation threshold technique.
In this paper the following additional results are presented: (1) a comparison between
two-tone suppression in pulsation threshold and in forward masking; (2) two-tone
suppression data at suppressee frequencies other than 1 kHz; (3) the effect of a white
noise background on the suppression effect. These additional data are considered
necessary not only for-better understanding of the general agreements and significant
differences established to date between ~heoretical predictions and experimental data,
but also for a better understanding of the suppression mechanism.
presented here exclusively in the Fig. 1 format.
For fixed suppressee frequency (fS) and
suppressor (or masker) frequency (fM) it
shows the effect of suppressor level (LM
)
on "activity in the suppressee channel" as
monitored by the probe threshold (L ),P
where fp=fS
' Suppressee level LS is a para-
meter. Suppressee S and suppressor Mare
presented simultaneously, the probe signal
p separately. Part (a) of the curve charac
terises the situation where the suppressor
is ineffective because LM
is too low. Branch
(b) shows increase in suppression (decrease
in response) with increasing LM
. Branch (c)
of the suppression curve represents the
situation where the suppressor is so much
stronger than the suppressee that Lp re
flects response to the suppressor (M) instead
of suppressee (S). The depth D of the sup
pression notch is a quantitative measure of
the suppression effect.
c-
o/8
1----'-----=-- -------
Two-tone suppression data will be
Fig. 1 Schematic result of a two-tonesuppression experiment. The ordinategives the response L to a fixedsuppressee (L
S) as aPfunction of sup
pressor level (L M). The curve istypical for f <f I.
M P
ll...J
1
Pulsation threshold vs forward masking
The comparison between the pulsation threshold and forward masking method in the
two-tone suppression case was primarily motivated by the considerable day-to-day
variability which was found in parts (b) and (c) of the suppression curves (Fig. 1).
Within subjects, threshold settings cover ranges up to 15 dB. Fortunately, the sup-
24 pression effect measured was often twice as large. Nevertheless we wondered whether
IPO annual progress report 12 1977
the variability was inherent in the pulsation threshold method, and whether more
stable results could be obtained by another method. Subjects did report some dif
ficulties in keeping the criterion in the adjustment task constant. Therefore we
decided to compare with results from a two-interval, two-alternative forced-choice
experiment. Since this paradigm cannot be used for determining a pulsation threshold,
it was applied in a forward masking experiment. This choice also helps to relate ourpulsation threshold results, e.g., to Shannon's (1976) forward masking data on two
tone suppression.
460ms4003002
M+S
020
Our implementation of the pulsation threshold technique has been described in Duif
huis (1977). In the forward masking experiment we used masker durations of 400 ms
separated by an BOO-ms silent interval. Probe duration was 20 ms at half amplitude,
and all signals had cosine-shaped ramps of 20 ms (Fig. 2). Probe onset started im-mediately at the end of the
masker offset ramp. The sub
ject used the sequential up-
and-down strategy described
by Cardozo (1966). Typically,
40 to 100 trials were presen
ted for the determination of
masking, respectively. In
both cases fS=fp = 1kHz,
f M=400Hz, and LS=45 dB SPL
were used. In one session the
subject measured a series of
points by one method and thesame serie again by the other
method. Data from one session
are represented by a single
symbol in the two figures.
Figure 3 shows results similar
to those obtained previously.
order of measurements:.0·0 .... 6,%x.o
S:JS30 '---+---t---t---f-----<>----1I-----l-~.
50 60 70 80 90MASKER LEVEL LMldB SPLl
4Q-
::JQ.III
~ 50 r----1---+----+----+-----<~-+--........-..,lI>.
....IQ....IoJ:
ffia:J:....~!<III....I::JQ.
one 75% correct threshold.Fig. 2. Time cours of the forward masking stimulus. Measurements were repeatedRamps are cosine-shaped. (M: suppressor, S: suppressee,P: probe). over the course of several
weeks. Fig. 3 and 4 show results of subject JS for pul
sation threshold and forward
Fig. 3. Ten series of pulsation threshold data (twotone suppression experiment) obtained over a 2-monthperiod. Parameters: f M= 400Hz; f p= 1kHz; LS= 45dB SPL.
::i 40Q.
0 0IIIIn
1 1.0 .
:g
~ t"lI>. .0....I
Q3(}
~....I0J: e <>III •Wa: <>J: '$>.... .xw 20 order of measurements: <>In .O·D"'~J:X·O <>~ S:JSQ.
50 60 70 80 90MASKER LEVEL LMldB SPLl
For LM<70 dB the estimated
standard deviation s = 1 dB
(branch a). With increasing
L the values of s increasesm
to about 5 dB.
Fig. 4. Data of the forwardmasking experiment corres-ponding to Fig. 3. 25
Figure 4 gives the corresponding forward masking results. Note that the variabilityhere is independent of L . Over the entire range we find s ~ 5 dB.mAt L = 85 dB we found significant correlation (r = 0.8) between forward masking andm -pulsation thresholds obtained within a single session. Lack of correlation at other
levels, due partly to the fact that variability in pulsation thresholds was too small,
prevents far-reaching conclusions from being made. At any rate, the forced-choice,forward-masking method did not produce results that were superior to pulsation thres
hold results insofar as day-to-day variability was concerned. Finally, it may beworthwhile mentioning that further measurements indicated that variability could bereduced by shortening measurement series to sessions of, at most, 15 min.
Two-tone suppression at other frequencies
70r------t------+--+-----lf------..--i---+------,
80 r-----+-----+--+-----lf------..--+-------.
3O=----±-------=1-----!--_---if--_--150 60 70 80 90 100
------- LM IdB SPLI
Additional two-tone suppression datawere obtained with the pulsationthreshold technique. The experimental
set-up was essentially the same asthe one previously used, except that
available equipment now produces trapezoidal envelopes instead of envelopes
with cosine-shaped ramps. Onset andoffset were adjusted to 25 ms.
Some results are shown in Fig. 5 for
f = 0.5, 1, 2 and 4 kHz. Qualitative-s
ly the results are very similar for
all suppressor frequencies. Two-tonesuppression depends strongly on theratio f If , but weakly on absolutem sfrequency. A close look at the datamight suggest a decrease in suppression at higher frequencies.
A
•
S:JVfplfmlHzl
.500/200·1000/600.. 200011200% 4000/2400
40
::l x XQ.Ul
IIIX:!:!
l- XII.....I
r
60
B
40 =-----..-""'±-_~±--+------:l--+---+---+----l50 60 70 80 90 100
--->LMIdB SPLI
Fig. 5. Two-tone suppression data atseveral suppressee frequencies f .In part A are shown examples forPf <f , in part B for f >f . All datasRowR in this figure uWedPL =60 dB SPL.s
The results shown here are consistent
with comparable data (Abbas and Sachs,
1976; Shannon, 1976) and theoretical
expectations.
••
•
•
S:JVfplfmlHzl
·500/600.2000/2400
Two-tone suppression against a continuous noise background
The effect of a continuous noise background on two-tone suppression was investigated
26 in a number of conditions where a relatively strong suppression effect had been
measured. The only change in the experimental set-up consisted in the addition of a
continuous white noise signal to the headphone. The noise was presented at 3 spec
tral densities differing by 10 dB.
Fig. 6. Two-tone suppression data at f = 2 kHz andf = 0.8 kHz and L = SO dB SPL, for sevgral continuoNs white noise b~ckgrounds. Parameter is the spectral density of the noise in dB/Hz.
90 100--_. LM (dB SPLI
8070
• nonoisex -2 dB/Hz• 8 dB/Hz• 18 dB/Hz
60
30
::JQ,enlD 50 •:g .lL
....I
I4
S:JV
Fig. 6 shows an example of the
results. The major effect of the
noise is to fill up the suppres
sion notch and decrease the sup
pression effect. A 20 dB increase
in spectral density level No
suffices to reduce suppression
from maximum to zero effect.
Reduction of suppression by noise
addition can be interpreted in
terms of wide-band noise acting
as a suppressor (cf. Houtgast,
1974; Duifhuis and Simons, 1976;Leshowitz and Lindstrom, 1977).
The background noise suppresses
the suppressee, and if maximumsuppression is obtained in this
way then addition of the tonal suppressor cannot amplify the suppression effect.
Since probe and suppressee are affected in the same way, the suppression by the
noise background does not show up in a downward shift of branch (a) in the suppres
sion curve (Fig. 1). A secondary effect is that the noise background also suppresses
the masker. This effect by itself would cause a local shift of the suppression curve
along the L -axis. The effect depends on the L /N ratio and will disappear at suf-m m 0
ficiently high masker levels. The suppression curves for N = -2 dB/Hz and N = 8 dB/Hzo 0
provide some evidence for this effect (levels in SPL).
The present results support the assumption that a simultaneous wideband noise maskeris an effective suppressor.
Conclusion
Day-to-day variability in pulsation threshold is the same size as in forward masking
using a two-interval two-alternative, forced-choice method.
Psychophysical two-tone suppression depends predominantly on the frequency ratio of
suppressee to suppressor, given the amplitudes.
In a background of continuous white noise the two-tone suppression effect seeminglydisappears.
Qualitatively, these findings are in agreement with our theoretical expectations. A
quantitative description of all data requires more theoretical and experimental work. 27
28
References
Abbas, P.J. and Sachs, M.B. (1976) Two-tone suppression in auditory-nerve fibers:Extension of a stimulus-response relationship, J. Acoust. Soc. Amer., ~, p. 112122.
Cardozo, B.L. (1966) A sequential up-and-down method, I.P.O. Annual Progress Report,1, p. 110-114.
Duifhuis, H. (1976) Cochlear nonlinearity and second filter. Possible mechanism andimplications, J. Acoust. Soc. Amer., ~, p. 408-423.
Duifhuis, H. and Simons, W.F. (1976) The0retical responses of the "hair-cell BPNL"model to bands of noise, I.P.O. Annual Progress Report, 11, p. 2-9.
Duifhuis, H. (1977) Cochlear nonlinearity and second filter. A psychophysical evaluation, In: E.F. Evans and J.P. Wilson (Eds.), Academic Press, London, p. 153-163.
Houtgast, T. (1974) Lateral suppression in hearing. Doctoral thesis, Free Unive~sity.
Amsterdam.
Leshowitz, B. and Lindstrom, R. (1977) Measurement of nonlinearities in listeners withsensorineural he·aring loss, In: E.F. Evans and J.P. Wilson (Eds.), Academic Press,London, p. 283-292.
Shannon, R.V. (1976) Two-tone unmasking and suppression in a forward-masking situation,J. Acoust. Soc. Amer., ~, p. 1460-1470.
Preliminary experiments on accent perception in tone sequences
J. Thomassen
Introduction
When listening to a sequence of tones, some tones seem to be more prominent than
others and are said to have accent. This description, in spite of its vagueness,
can be turned into an operational definition of accent. Accent, then, is to be
considered as a concept in the perceptual domain. Note that it can be described
without making use of physical properties of the tone sequence.
In the physical domain, certain £actors may elicit the perception of accent. The
mapping of the physical domain onto the perceptual domain is the main purpose of the
investigation, which started just over a year ago. It will be useful to introduce
separate terms in the physical domain. Hence accentuation is introduced as a physi
cal property of a sound which gives the impression of being an accent.
Restricting ourselves to pure tones, three physical properties stand out.
1. A tone that has a higher ioudness than its neighbours is said to have dyn~mic
accentuation.
2. Temporal accentuation is understood to be the set of operations in the time
domain that result in the perception of accent.
3. Melodic accentuation is the accentuation given by the succession of frequency
intervals of the sequence.
The three accentuations will, of course, interact and non-physical factors like
memory, expectation etc. are known to affect accent(-perception) as well.
An operational definition of accent is closely related to a method of measurement. The
first problem to be solved is thus to find a reliable and efficient method of measure
ment, that does justice to the common notion of the concept of accent. Apart from that,
it must be ensured that the definition is not dependent on the particular method used
and therefore a number of methods has to be compared. It should then turn out that
in a given sequence the same tones are considered as accents with the various methods.
The methods can be profitably tested using stimuli with dynamic accentuation only,
as the mapping accentuation-accent seems rather obvious in that case.
Having found (a) methodes) of measurement that can be extended to the other parameters,
a measuring programme will be carri~d out by one particular method that has been
established as most suitable. In this programme it is hoped to find a relationship
between accent perception in music and accent perception in speech. The most power
ful cue to the perception of accent in speech appears to be a rapid pitch rise, pitchfall or combination of the two (It Hart and Cohen, 1973). It has also been shown that
the timing of pitch movement with respect to vowel onset and end is particularly im
portant, as is its position in the overall pitch contour (Van Katwijk, 1974). There
fore the investigations will be mainly directed towards the contribution of melodic
accentuation to the perception of accent in music. 29
IPO annual progress report 12 1977
Experiment 1
The simple method started with, involves indicating the position and strength of
perceived accentuation in a sequence of tones on a response form.
The stimuli were short isochronous sequences of N(N=4, 5 or 6) 1 kHz tones (55dB SL)
with accentuation produced only by small differences in sound level. In each se-
quence no more than one tone was accentuated. The posltlon (n, ,n=l, 2 ... N) of the
accentuated tone and the strength of the accentuation (S, S=l, 2, 3 or 4 dB) were
varied over all possibilities.
The tone sequences were generated on the MARIE set-up (Moonen, 1975) connected
to a P9202 mini-computer and a HP3320B frequency synthesiser. The recorded stimuli
were presented to 10 subjects, diotically,using headphones in a sound-proof booth.
Averaging the number of incorrectly localised accentuations, likewise as the number
of accentuations shifted one position and dividing by the number of stimuli, we
obtain the quantities C(S,N) and C1(S,N). Combining the results over N=4, 5, 6 and
averaging over the 10 participating subjects gives Fig.1 in which C(S,N) and C1(S,N)
are plotted with their standard deviations.
( 1)
(2)
C(S,N)=D(S)E(N)+(l-D(S)) .P(N)
C1 (S,N)=D(S)E1 (N)+(l-D(S)) .P1 (N)
A simple model may clarify the observed data. Having
a detection probability of accentuation, D(S), we are
left with the probability of making a localisation
error in case of detection of accentuation, E(C), and
eventually the probability of shifting one position,
E1(N), averaged over all positions in the sequence.
For a gambling subject (stimulus not detected) the
last two probabilities reduce to P(N)=(N-2)/N and
P1(N)=2.(N-1)/N 2 , respectively. We can now write down
the following equations for the probabilities:
Putting E(N)~E1(N) (most localisation errors will be
one position out), D(S) is obtained by subtracting
(2) from (1). An estimate of E1 (N) can be made taking
C1(4,N)~E1(N) because we have D(S)~l for S=4 dB.
4 S(dB)-32
1.060
t tz0 a:~ w
IIIC)~ pc(
a: ::llAo Z
The fitting of dotted lines and data points confirm
that the approximation E(C)~E1 (N) was a fairly good
one. In Fig. 1 the dotted lines connect the points'
C(S,N) and C1(S,N) that have been recalculated by sub
stituting the values of D(S) and El(N) in equations
(1) and (2).
We see that with a level difference of 2 dB the 10 subjects could already achieve
more than 80% correct responses in localising the accentuated tone. However, some
subjects still made errors E(N) at a level difference of 4 dB. The incorrect
responses were mostly one position out (E(N)~E1 (N)) and were thus termed "counting
Fig. 1. Fraction (number) ofincorrectly localised accentuations C and the fraction(number) of accentuationslocalised one position out/displaced C1 as a functionof strength of accentuationS; P and P1 are the corresponding chance levels. Average over 10 subjects.
0.5 30 J.Cr,,,,\
~\r---~--P1
\,
C11"'l'" " 11
'<.,.~~0.01o::0~ ...,.;..}_E1_
30
errors". Re-inspection of the data showed that certain subjects were responsible
for the "counting errors". In future such subjects may be discarded through adoption
of an E1(N)- selection criterion.
So far this simple method has been found suitable for measuring accent perception
apart from some efficiency improvements to be made. However, a drawback could be
the tendency of subjects to respond at the first tone of the short sequence in the
case of slight accentuation. Working with short fragments of melodies ("motifs"),
where ambiguity is perhaps the rule rather than the exception, this might distort
the possible outcome of the experiments. A solution to this problem could be the
insertion of the motif in a context, as in the next experiment.
Experiment 2
This experiment was arrived at on the following considerations. The accentuation
present in the context, in which a motif is embedded, influences the accent perception
within the motif. Now restriction is made to contexts with a periodic accent struc
ture or metre. A periodic accent structure once established tends to be continued in
the mind of the listener: forthcoming accents are anticipated at distances deter
mined by the period. This is independent of the real occurrence of accents in the
anticipated material. Suppose a motif is embedded in a context with periodic accen
tuation. The anticipated accent of the motif is thus forced into a position deter
mined by the period of the context. This can be done successively for all positions
in the motif and each time the corresponding distribution of accent responses over
the motif tones can be measured. By mutual comparison of these distributions we
are able to decide which motif tone lends itself best to accent apart from antici
pation; next we can infer from the physical parameters of all the motif tones the
(combination of) physical factors which can be conceived as accentuation.
The distributions of accent responses are measured indirectly by asking to what
extent the expected accent coincides with the perceived accent in the motif. As a
criterion for judgement we take the answer to: "How well does the accentuation of
the sequence continue in the motif?"
It is assumed that a subject can handle this criterion well in a situation in which
he compares a pair of tone sequences and indicates which sequence best meets thecriterion.
This task was allocated to 10 subjects. In addition, 6 of them marked how well each
tone sequence of a pair met the criterion on a ten-point scale (10 corresponding to
the optimum). The stimuli were isochronous sequences of 1 kHz tones, the difference
in sound level between accentuated and unaccentuated tones being 4 dB. The sequencesused are shown in Fig. 2.
The motif tones are indicated by tildes, the arrows pointing to the anticipated accents.
There were two conditions: in the first the motif tones could be recognized by their
smaller tone durations, in the second the motif tones and the context tones could not
be distinguished. All possible pairs of the five sequences were constructed and pre-
sented four times in each condition. 31
Fig. 2. Different ways of embedding a threetone motif in a context with triple or duplemetre. The sequences are isochronous, theoverall tone frequency is 1 kHz. The motiftones are indicated by tildes; the arrowspoint at the anticipated accents.
10
tf9
8
7l-
Iz
6w:::E
tt I.ww 5a:C)c( 4
3
2
1
~-'Y - (-I
-- - (-I
Fig. 3. Agreement of perception of accent incontext and motif for the sequences in Fig. 2.Results for scaling (~) and absolute paircomparison (0).32
!fw !fuz
!fwa:wLLLLa
!f..JW>W..J
a
rfz:;)
0(J)
o
A
1 2
D
~-...-
-
~,.,,--
345
TIME
C
SEQUENCE
6
E
7
CODE
A
B
C
D
E
sec.
B
The order within the pairs was balanced
and the presentation of the pairs was
random. The stimuli were generated
with the same set-up as experiment 1.
The data showed no difference between
the 2 conditions and no effect of order
within the pairs. The scaling results
averaged over the 6 subjects and 2
conditions (~) are plotted in Fig. 3
together with the results of the ab
solute pair comparison (averaged over
the 10 subjects and 2 conditions)
transformed to the same scale (0).
The sequence A is scaled significantly
higher than the sequences C, E and B.
This was to be expected because in
A the accentuation coincides with
the anticipated accent. The sequence
D is scaled somewhat lower than A
because there is another accent ex
pected in a position where there is
no accentuation. The scaling results
do not differ very much from the re
sults @f absolute pair comparison
apart from the small but important
difference that with scaling a clearer
distinction can be made between se
quence D and the low-estimated se
quences. For this reason scaling is
perhaps to be preferred to absolute
pair comparison, the more so as
scaling can be applied quite apart
from a pair comparison tas~, which may
save a considerable amount of time.
Used in this way the method can
provide the possibility of measuring
accent because it yields unequivocally
the cases of coinciding accentuation
and anticipated accent.
At the same time two drawbacks of the
method used in the first experiment are
removed. First, the bias of the first
tone (to be considered as a limiting
case of anticipation, viz. the case of
no context). Second, a direct accent
response on the part of the subject. It seems more favourable for the subject to respond
indirectly, i.e. by interpreting the sequence rhythmically instead of applying explicit
criteria as to what tones are to be considered as accents.
~wnma~
Two preliminary experiments, that may become a bridgehead in the measurement of ac
cent perception in tone sequences have been described. A third method of measure
ment - viz. a method using tap responses - has been left open. The methods discussed
so far have yielded no contradictions with respect to the demands we have made on
the implementation of an operational definition of accent. This is not very surprising,
for dynamic accentuation leaves little room for such contractions. However, there.
seems to be reason enough to assume the feasibility of measuring accent perception
in tone sequences with arbitrary accentuation parameters.
References
Hart, J. 't and Cohen, A. (1973) Intonation by rule: a perceptual quest, Journalof Phonetics, 1, p. 309-327.
Katwijk, A.F.V. van (1974) Accentuation in Dutch, Van Gorcum B.V., Assen, Holland.
Moonen, G.J. (1975) MARIE, een Modulaire Aanpassing tussen een Rekenmachine Interfaceen een ~xperiment, IPQ Report no. 267.- --
Thomassen, J. (1976) Waarneming van geringe dynamische accentuering in toonreeksen,IPQ Report no. 300.
33
Estimation of annoyance due to low-level sound
B. LCardozo and K.G. van der Veen
Introduction
Loudness, noisiness and annoyance are subjective attributes of sound. Loudness, i.e.
the subjective correlate of sound intensity, can be assessed in a small number of
subjects because there is, normally, fair agreement among them. Their responses canbe accurately predicted on the basis of physical measurement (cf. e.g. Zwicker et al.,
1967) .
For the assessment of noisiness, i.e. the unwantedness of a sound with particular
reference to its intensity, many subjects are needed in order to average out personal
opinions. Contrary to loudness, there are as yet no general algorithms for predicting
the magnitude of noisiness on the basis of physical measurement (cf. Scharf, 1974).The present paper will not deal with loudness or wi th noisiness but will concentrate
on annoyance.
Annoyance is, like beauty for instance, hard to define precisely but can be considered
as the unwantedness of a sound in general. It therefore depends on personal preferen
ces but, in addition, the environment both acoustic and otherwise must be taken into
account. It is good practice to avoid these problems by presenting a limited set of
not too different sounds to a fair number of listeners who are instructed to give an
annoyance rating or something equivalent. Theoretically, one would have to use a rich
set of sounds, representative of what the population is normally exposed to: traffic
sounds, music, building noise, etc. This set should then be administered to an adequate
sample of the population and their reactions in terms of annoyance noted.
Sound character
Assume now that the above theoretical annoyance data are plotted against a great
many physical parameters, measured for every sound in the set. We would then construct
a multidimensional space, some of the axes representing the sound levels in various
frequency bands, others representing their time derivatives, still other axes giving
the total duration, an objective physical estimate of pitchiness, etc. The set of
sounds is represented as a set of points in this multidimensional "annoyance space".
These points will not be distributed randomly over the space. In fact, many inves
tigators have found annoyance to correlate highly with sound level, cf. Botsford
(1969). Therefore the number of dimensions of the annoyance space can be reduced
by projecting all intensity dimensions onto one new axis, labelled LA' without
seriously affecting the original configuration of points. The choice of the A
weighting factors for this projection instead of more sophisticated weightings is not
essential. We are now left with a space of lower dimensionality in which two axes
are: annoyance and LA' We now collapse all other dimensions into one that gives a
maximal correlation with annoyance, and that is orthogonal to the annoyance axis
and the LA axis. We propose to call this third axis the sound character. In brief,
34 the concept of "sound character" is introduced as the weighted combination of physical
,PO annual progress report 12 1977
properties affecting the annoyance of a sound, with the exception of the A-weightedlevel.
In the above picture two problems have been omitted. First, no mention has been made
of the environmental "dimensions" and, secondly, it has been tacitly suggested that
the annoyance space is linear. However, restricting the discussion to one type of
environment and limiting the set of sounds to a small region in the annoyance space,
it seems legitimate to consider a concept of "sound character" even though it is alocal one.
Annoyance of low-level sounds
The view is advanced that the contribution of sound character to annoyance is re
latively more important at a low level than at a high one. Indeed, at extremely high
levels the annoyance is just pain, no matter what the sound character is:
At moderately high levels, e.g. 70 - 100 dB(A) the literature is not unequivocal on
the contribution of sound character to annoyance. There are two kinds of papers. The
empirical ones, correlating LA to community reactions seem to indicate that LA does
do the job of gauging annoyance (cf. ISO R 1996). This does not, however, disprove
that the sound character is important. A second class of papers is based on labora
tory experiments and maintains the view that noiseness and annoyance cannot be des
cribed adequately by the sound level alone.
Berglund et al. (1976) conclude that certain types of noise (jackhammer) are gene
rally considered more noisy than loud, and the more so the lower the sound level.
Otherwise stated, the sound character is important, especially at a low level.
we"UZzed;:~\a::Z r------....,Zc(
Fig. 1. Part of the annoyance space. Thedrawing is merely meant to illustrate theassumed effect of sound character on theannoyance.
Izumi (1977) investigated amplitude
modulated pink noise with an equivalent
level of about 70 dB(A). This stimulusproved to be noisier than loud by the
equivalent of up to 10 dB(A).
Klaassen (1971) presented a synthetic,
complex, broadband noise of about 55 dB(A)
to some hundred listeners. In comparing
steady state with 100% amplitude-modulated
and with a 6% frequency-modulated v~rsion
(both with a modulation frequency of 3 Hz),
he found the annoyance due to A.M. (F.M.)
to be the equivalent of 8 (7) dB.
It is difficult to find studies on annoyance at still lower sound levels. A
paper by Viebrock et al. (1975) deals
with direct assessments of the loudness
of electric clocks that produce 20 to 35
40 dB(A). The paper is relevant insofar as their subjects comment on what we have
termed the "sound character". We, therefore, think that it is interesting to study
annoyance of sounds with a relatively low level in order to see whether factors other
than the sound level LA must be taken into account. Fig. 1 is meant to summarize the
conceptual situation.
With the above considerations in mind, a pilot study was made of the sound of the
household refrigerator. It is our conviction that annoyance studies should deal with
common sounds, known to the subjects.
The refrigerator sound satisfies this condition. It has, moreover, a low level.
Finally, although the refrigerator is possibly the most silent of household machines,there is an increasing number of complaints about its noise, probably due to thegrowing number of open kitchens.
Ustening experiment
In order to get an idea of the contribution of the sound character to the annoyance
caused by refrigerator sound, a listening experiment was performed with 15 subjects.
Every subject was presented with 56 pairs of sounds through headphones at a level
of about 50 dB(A). He had to tell whether the first or the second member of the pair
was the more annoying sound. Each sound lasted 3.5 s, there was a pause of 1.5 s
between the members of the pair and each pair was followed by a response pause of
4 s. The 56 pairs covered all possible pairs of 8 test sounds, twins excepted but
reversals of order included.
The subject was instructed to judge the sounds as if he were exposed to them while
relaxing at home. The test sounds are given in table I.
Description of test sound Code Equivalent level
Natural Equalized
dB(A) dB(A)
Normal sound of refrigerator inCONtinuous operation CON
RUMbling version of CON,waxing and waning RUM
Normal ONset of refrigerator NON
Normal OFfset of refrigerator NOF
Processed ONset with "improved"character PON
Processed OFfset with "improved"character POF
White NOIse, reference signal NOI
CON, Amplified by about 16 dB,serving to gauge the scale COA
Table I.
39
52
47
39
40
33
41
56
39
41
39
40
40
44
40
55
The test sounds were used in two similar experiments. In the first experiment the
sounds were presented at the natural levels (save NOI and COA) but in the second the
36 sound levels were changed so as to make them more or less equal (COA and POF excepted).
Natural levels Equalized levels
POF NOI CON PON NOF RUM NON eOA CON NOI RUM NOF POF NON PON eOA
POF --- 15 23 23 26 27 28 30 CON --- 11 20 17 19 24 26 29
NOI 15 --- 25 20 24 28 22 26 NOI 19 --- 17 17 16 17 18 25
CON 7 ' 5 ' --- 15 19 27 28 27 RUM 10' 13 --- 16 15 19 25 30
PON 7 ' 10' 15 - -- 19 23 24 23 NOF 13 13 14 - -- 13 13 16 29
NOF 4' 6' 11 11 --- 24 23 27 POF 11 14 15 17 - -- 15 13 24
RUM 3' 2 ' 3' 7 ' 6 ' --- 18 27 NON 6 ' 13 11 17 15 --- 14 30
NON 2 ' 8 ' 2 ' 6 ' 7 ' 12 --- 24 PON 4' 12 5 ' 14 17 16 --- 28
eOA 0' 4' 3 ' 7 ' 3 ' 3 ' 6' --- eOA 1 ' 5 ' 0' 1 ' 6' 0' 2'---
Z 38 50 82 90 104 144 149 184 l: 64 81 82 99 101 104 114 195
a 18 24 39 43 50 69 71 88 a 30 39 39 47 48 50 54 93
Tables II and III. Voting tables for the experiments with natural (II) and equalized(III) levels. Each entry gives the number of times that the sound above the column wasvoted more annoying than the sound to the left of the row. Apostrophes indicate ratiossignificantly different from 15/15 (P<0.05). a is the normalised annoyance measure.
The results are shown as 'voting' tables II and III in which the sounds have been
arranged in order of increasing annoyance. For example, in table fI RUM was voted 24
times to be more annoying than NOF, whereas NOF was only 6 times judged to be more
annoying than RUM. This ratio is significantly different from 15/15 at a level lower
than 1%. The theoretical maximum sum of annoyance votes for one test sound, 210, is
used as a divisor to obtain the normalised armoyance a/100. Thus, a ranges from 0 to 100.
0
90 at EQUALIZED LEVELS
80
C' 70zl-et 60~
w 500z~ 400zZet
40
LAeq 10
o50 60 30 40
dBlAlEQUIVALENT A· LEVEL
50 60dBtAl
Figs. 2 and 3. Annoyance rating a of test sounds as a function of L . Linear regressionlines a = 2.9L - 75,a= 3.2LA - 84 respectively, are heavy lines. T~in lines witharrows connectAsimilar sounds before and after processing (NON, PON; NOF, PO F) oramplification (CON, eOA). 37
Figures 2 and 3 present the 'natural' and the 'equalized' experiments respectively
in an LA-a diagram. Regression lines are drawn that minimize the sum of squared a
deviations. From these figures one can make the following conclusions.
Results
1. The A-weighted sound level LA has a preponderant influence upon the annoyance of
the test sounds. dB(A) corresponds to roughly 3 points in the centesimal annoyance scale.
2. As a rule, the continuous sounds turn out to be less annoying than the onsets
and offsets of equal level. For the normal, unprQcessed sounds the difference
corresponds to about 3 dB(A). It is likely that this result is an underestimation,
since the effect of startle is probably more important in real life than in the
laboratory situation.
3. Processing the onset and offset sounds does improve their character. The average
effect is equivalent to 3 dB(A). This may be an underestimation for the reason
mentioned above.
A general remark is justified as to the feasibility of this type of experiment. Al
though the subjects listened with headphones in a soundproofed booth to sounds as
short as 3.5 s, they were fairly consistent in their responses. Inconsistencies centred
around the white noise that was considered by some of the subjects to be a very alarm
ing refrigerator sound.
Conclusion: annoyance ratings of a rather low-level sound such as that produced by a
refrigerator has been shown to be mainly dependent upon the A-weighted level, but to
a minor extent also on the sound character.
Acknowledgement
The assistance of G. Doodeman in preparing the stimulus tapes is gratefully recorded.
Summary
The concept of sound character is introduced as a physical attribute responsible forany systematic differences in annoyance due to different sounds at the same A-weighted
sound level. It is thought that this sound character is more important at low soundlevels than at high ones. A pilot experiment with refrigerator sounds does indicate a
clear, though slight effect of sound character. Especially the sharp onsets are
shown to worsen the character of the sounds in question.
References
Berglund, B., Berglund, U. and Lindvall, U. (1976) Scaling loudness, noisiness andannoyance of community noises, J.Acoust.Soc.Am., 60, p. 1119-1125.
Botsford, J.H. (1969) Using sound levels to gauge human response to noise, Soundand Vibrations, l' p. 16-28.
ISO R1996, Assessment of noise with respect to community response, 1st ed. 1971,38 Obtainable through the National Standards Organisation.
Izumi, K. (1977) Two experiments on the perceived noisiness of periodically intermittent sounds, Noise Control Engineering, ~, p. 16-23.
Klaassen, J.A. (1971) Fluctuations of a background noise add to its annoyance, in:P. Zonderland (Ed.), Noise 2000, proc. of congresses 5 and 6 of A.I.C.B.,Groningen, Wolters-Noordhoff Press, p. 199-200.
Scharf, B. (1974) Loudness and noisiness - same or different? Internoise, 74, Proceedings of the 1974 International Conference on Noise Control Engineering heldin Washington D.C., New York: Noise/News, p. 559-564.
Viebrock, W.M., Crocker, M.J. and Cooper, W.A. (1975) Loudness evaluations ofelectric clock noise, Appl. Acoust., ~, p. 193-201.
Zwicker, E. and Feldtkeller, R. (1967) Das Ohr als Nachrichtenempfanger, 2e ed.,S. Hirzel Verlag, Stuttgart, p 184-203.
39
40
Speech
An experimental system for man-machine communication by meansof speech
H.F. Muller, S.G. Nooteboom and L.F. Willems
Introduction
In many communication situations, speech is the fastest, most natural, and most
flexible medium for the exchange of factual information between people. For computers
this may not be true. It seems reasonable, however, when we consider possible ways of
man-computer interaction, to ask whether we can make computers speak and understand
speech. Studying this possibility is the more urgent as the amount of people dealing
with computers in their daily life is rapidly increasing and will soon include not
only professionals but also the general public.
Advanced techniques for computer voice read-out of stored information and automatic
recognition of spoken commands, have been applied in the laboratory to man-machine
communication by voi~e, for instance automatic booking of travel reservations,Flanagan (1976).
More ambitious attempts, aiming at applications in a more distant future, and involving
automatic recogni tion and understanding of whole sentences, have b"een made wi thin
the ARPA Speech Understanding Project (Klatt, 1977).
In our institute we have recently set up a research project, the main purpose of
which was to consider the possibilities and problems in communication by voice
between a computer system and many users, given a rather simple and not error-free
word recogniser. In this project we have set ourselves the task of making a limited
computer information service in the Dutch language capable of giving the departure
times of intercity trains from Eindhoven railway station in four different directions.
Potential users would be the 40 or so male employees of our institute. The system
was completed within nine months as had been agreed before starting the project.
For a computer to carryon informative conversation with a human it has to have suchfacilities as a speech recogniser, a voice response unit, a data base concerning
the topic of conversation, and a set of strategies by which it knows what to do with
the incoming information and what to say when.
We estimated that technologically the speech recogniser was the most difficult part
to realise. We thought it feasible to build a speech recogniser recognising isolated
words from very limited vocabularies, accepting speech from about 40 male speaker3,
and achieving a reasonable recognition score. From this followed the main philosophy
of the project: to restrict the messages spoken by the user to the machine inconspicuously
to isolated words from very limited vocabularies, by implementing a dialogue structurein the form of carefully chosen questions, in which the system retains the initiative.
Below we will briefly describe the resulting system under the headings of word recog
niser, linguistic form of the dialogue, voice output, and system control structure.
Each heading will be followed by the names of the members of our institute who contri- 41
IPO annua1 progress report 12 1977
buted part of their time to that component of the system. Finally we will present
some tentative conclusions.
Word recogniser (Muller, Dobek, van Nes)
Requirements on the word recogniser were:
a. easy to build
b. operating in real time
c. suitable for recognition of isolated words from limited vocabularies of 2 to 9
Dutch words
d. accepting speech from about 40 male speakers
e. recognition not error-free
f. bandwidth 250 - 6000 Hz.
Acoustic processing and input
For acoustic processing we use a filter bank, having 14 filters the bandwidth of
which corresponds to the selectivity of the human ear. Output of the filters, number
of zero crossings and total energy are sampled every 1Oms. In the input phase an
algorithm is applied for detection of beginning and ending of the speech signal from
zero crossings and total energy.
Data reduction
Time normalisation is applied by dividing the total duration of the signal into an
equal number of segments such that the accumulated spectral change within each
segment is equal for all segments. The number of segments depends on the vocabulary
and varies from 6 - 10.
Further data reduction is obtained by reducing the spectral shape of each sample.
This is done by coding the derivative of the spectral envelope with one bit only.
The reduced spectral shapes of all samples within each time normalisation seginent
are then averaged. In this way the spectrum of each such segment is coded in 13 bits.
Finally the total energy and number of zero crossings are averaged per segment, and
the derivatives in the time dimension coded with one bit each. We thus obtain a total
of 15 bits per segment.
Training and classification
The system is trained with 20 male subjects who spoke each word three times. Tha
resulting 60 patterns for each word are reduced in number by the condensed nearest
neighbour method (Cover and Hart, 1967; Hart, 1968).
Actual recognition is achieved by nearest-neighbour classification. In each particular
instance of word recognition, classification is in terms of a limited vocabulary
as determined by the structure of the dialogue.
42 The vocabularies used in this system are:
- four direction words
- three part-of-the-day words- nine day words
- twice nine number names,
and the smallest vocabulary: "ja" and "nee" (yes, no).
Classification leads to three different modes of recognition (except for the yes no vocabulary):
(1) certain recognition of a word(2) uncertain recognition of a word
(3) no recognition.
Unguistic form of the dialogue (Leopold, van Katwijk)
Due to its limited recognition power the system could hardly be provided with flexible
conversational niceties such as the ability to react meaningfully to a user's comments
or questions. It has to retain the initiative, imposing a rigid structure on the
dialogue, and using carefully chosen questions such that the user's answer to each
question is predictably one of a few (2 - 9) isolated words, forming the vocabulary
appertaining to that question. Both the structure of the dialogue and the wording
of the questions have been tried out experimentally.
The questions are directed at a specification of the three parameters needed by the
system, viz. direction, day of the week, and hour of the day. This could of course
be done by means of three questions only. For reasons of efficiency and in order to
break up the vocabularies of possible answers, more questions were used,even in case
of correct recognitions only.
Some of the properties of the dialogue will become clear when parts of the actual
conversation are discussed.
After a monologue in which the system introduces itself, indicating what it is and
does, and promising departure times of intercity trains from Eindhoven, it enumerates
the four cities related to the directions in which intercity trains are running.
This menu-type enumeration is concluded with "van toepassing voor u is ... "
C'applicable for you is .. " ), where the user has to fill in the information by
uttering one of the city names.
Menu-type questions are not the only ones used in the dialogue.
The next step (after confirmation by the system of the recognised city name) is to
establish the day. This is done by first asking a yes/no-question about the most
likely day of departure, namely: "vandaag?" (today?). This very short question
focuses - by so-called conversational implication (cf. Bunt, this issue) - attention
on the day parameter. The user saying "nee" to "vandaag?" is asked: "welke andere
dag?" (which other day?).
The next step, leading to number names for the hours, has to be made in two, first
by asking for the part of the day: "ochtend, middag, avond?" (morning, afternoon, 43
eveningn after which the system asks: "welk uur tussen X en Y?" (which hour between
X and Y?) where X and Yare in the twelve-hour system and where the number of possible
answers is limited by the durations of mornings, afternoons and evenings. X and Yare
5 and 1 for mornings and evenings, 12 and 6 for afternoons.
On the basis of specifications of direction, day and hour, the system consults its
data base and produces as a rule three departure times round the target hour. If the
next question: "wilt u meer inlichtingen?" (do you want further information?) is
answered with "nee" the system breaks off the dialogue with a cheerful farewell. If
the answer is in the affirmative, for reasons of efficiency the system does not
proceed immediately to the beginning of a new cycle.
Instead it asks: "zelfde richting?" (same direction?) ,and - depending - "zelfde dag?"
(same day?) ,and acts accordingly by entering the cycle at a point where new infor
mation is wanted.
As may be evident from the examples given, the system speaks in a somewhat elliptical
style. This turned out to be an improvement over previous, simulated versions where
more elaborate speech was found to be somewhat irritating.
A special comment must be made about uncertain recognition, wrong recognition and no
recognition. With uncertain recognition (determined in the recognition process) the
word to hand is presented to the user in a yes/no-question: "gaat het om X?" (does
it concern X?). In the case of wrong recognition the system has unfortunately no faci
lity to react to protests on the part of the user, who may have to be reconciled with
a trip he never intended to make. If a word is not recognised in three successive
trials, the system refers the user to the railway information office.
Voice output (De Jong, Willems)
The output speech used is digitised and stored real speech. Of course this has the
disadvantage that essentially no rules can be applied to modify the speech waveform,
and therefore it is difficult to obtain acceptable speech by concatenation of units
like words, syllables or phonemes. Speech output from digitised real speech is large
ly constrained to prerecorded whole messages. In our case, due to the occurrence of
variable items such as the days of the week and the hours of the day in the output
sentences, this would have given a very long list of messages. We therefore opted for
a compromise solution in which sentences were prerecorded as wholes, but variable parts
of otherwise identical sentences were inserted by means of speech editing (cf. Willems
and De Jong, 1974). This especially applied to the specification of hours and minutes.
An example of an output message, with prerecorded parts indicated, may be: "(The first
intercity train leaves at)+(seven)+(hours)+(twenty)". In order to avoid undesired dis
continuities at the fragment boundaries (+'s), all items were spoken in the context
in which they had to appear by a trained speaker who took care to keep the intonation
of the frame sentence constant. All recordings were made in the same recording session.
The messages resulting from assembling the prerecorded sentence fragments sounded per-
44 fectly natural.
System control structure (Muller)
The system control structure consists of 3 main parts, a maintenance programme, a
data base, and a control programme.
Maintenance programme
The maintenance programme is an interactive system for implementation and modification
of the data base. It has no further function in the operation of the system.
Data base
The data base contains the structure of the dialogue, the language material for voice
output, the vocabularies for the recogniser, a time table, a calendar, and a clock.
The structure of the dialogue
The structure of the dialogue consists of concatenated trees of questions (put by
the system) and answers (given by the user). Each branch of the tree effectuates a
correspondence between a particular result from the recognition process to ,a sequel
of the dialogue.
The control programme controls the interaction between the machine and the user: it
selects a message for voice output and the accompanying vocabulary for the expected
responses. Then it starts voice output and opens the voice input. Then, again a
following message is selected as laid down in the dialogue structure on the basis of
the received (recognised) answer of the user.
If the information necessary to access the time-table is complete, three departure
times are chosen from it and passed on to the voice output system.
Some tentative conclusions
For the limited goals of this information system the kind of dialogue control structure we used is found to be sufficient to guide users successfully through the dia
logue, in most cases where the word recogniser is not functioning too badly. Generally,
users of the system have little difficulty in adapting their verbal responses to the
requirements of the system. Even so, we feel that for more complex systems we can not
easily extrapolate from our experience with this rather limited ad hoc one, and we
would greatly profit from more basic insight into the mechanisms of human dialogues.
Of course, more complex systems will naturally require a larger variety of output
messa~es, and in that case the use of digitised real speech, perfectly satifactory
in the present system, may become impracticable as voice output. For other systems
one may think of synthesis from prerecorded and analysed words, morphemes or diphones,
depending on the requirements of the system. The acoustic form of such units can then
be modified by rule in order to arrive at acceptable speech quality.
As might be expected, the weakest link of the present system is the word recogniser.
Besides easily reparable shortcomings, such as the absence of the possibility to 45
46
erase incorrect, yet certain, classifications, the following are more serious ones,
specifically concerning the word recogniser: (1) For many of the speakers, performanceis inadequate, to the extent even that they do not always get the desired information.
(2) It is not easy to find out why recognition fails. For future w0rk in this line itseems advisable to take a rather different approach enabling researchers to follow
more closely the behaviour of the recogniser in response to different acoustic signals.
Notwithstanding the shortcomings of this, our first attempt to let a computer speakand understand speech, we have shown to our satisfaction that, in principle, it is
possible to overcome the difficulties stemming from a very limited and not error-freeword recogniser, by using a dialogue control structure that lays constraints on theverbal behaviour of the human participant in a man-machine dialogue.
References
Cover, T.M. and Hart, P.E. (1967) Nearest-neighbour pattern classification, IEEE TransIT, 13, no. 1, January 1967, p. 21-27.
Flanagan, J.L. (1976) Computers that talk and listen: Man-machine communication byvoice, Proceedings of the IEEE, Vol. ~, no. 4, April 1976, p. 405-415.
Hart, P.E. (1968) The condensed nearest-neighbour rule, IEEE Trans-IT, May 1968,p. 515-516.
Klatt, D.H. (1977) A review of the ARPA-Speech Understanding Project, Expanded preprint version of a paper to be published in the Reviews section of the Journalof the Acoustical Society of America.
Willems, L.F. and De Jong, Th.A. (1974) Research tools for speech perception studies,I.P.O. Annual Progress Report, ~, p. 77-81.
The Formator: a speech analysis-synthesis system based on formantextraction from linear prediction coefficients
L.L.M. Vogten and L.F. Willems
Introduction
It is well known in speech that the value of the waveform at a given instant is close
ly correlated with its values at previous instants, and hence represents redundant
information (Flanagan, 1972). Among the many models describing the speech signal more
efficiently the production model based on the linear predictability of the speech
wave has been quite successful (e.g. Fant, 1960; Atal and Hanauer, 1971; sambur, 1975).
This Linear Predictive Coding (LPC) of speech represents the wave form in terms ofrelatively slowly varying parameters which are related to the transfer function of
the vocal tract and to the characteristics of the speech source. The LPC analysis is
in fact the calculation of an Mth-order digital filter, the coefficients of which
are determined by minimising the mean squared error between the actual input sample
and an Mth-order linear prediction of the input sample. From these M coefficients the
speech wave can be resynthesised as the output of the inverse filter with the same
M coefficients, excited by pulses or by noise (Markel and Gray, 1976).
The LPC method has the advantage that only relatively short segments of the speech
wave are analysed in the time domain. No Fourier transform is performed and analysis
can be rather fast. Unfortunately the M filter coefficients are less suitable for
further processing because small errors in the coefficients can result in large
errors or even instability of the inverse filter used for the synthesis.
On the other hand, we know that a description of the speech wave in terms of natural
frequencies of the vocal tract or formants, is a very efficient one. Formants also
change relatively slowly with time (Flanagan, 1970). Hence, if we are able to de
termine the formants from the M filter coefficients the LPC analysis cuts both ways.
The present contribution describes such an analysis-synthesis system based on formant
extraction from the linear prediction coefficients. The system determines 5 formants
from a 10th-order LPC analysis. This "Formator" has been developed at our institute
and provides a powerful tool in phonetic research, because formant- (and also pitch)
trajectories can be isolated, varied, stylised or quantised and the effect of these
manipulations on the perception of speech can be studied. The "Formator" may also
prove useful in application fields such as voice response units, low bit rate vo
coders, speech recognition, etc.
The analysis part of the "Formator" has been implemented in software on our P9202
computer. The synthesis part is a digital hardware synthesiser. First we give a
general description of the analysis part, followed by details of the LPC analysis,
the formant extraction and the pitch extraction, whereupon the hardware synthesis 47
IPO annua1 progress report 12 1977
part is briefly described. The second part of this contribution gives some examples
of the practical use of the system for bit rate reduction and for its use as
"Intonator" (Willems, 1966) in phonetic research.
The 'Formator'
A block diagram of the system is shown in Fig. 1. The original speech is digitised with
10 kHz sample frequency, 8 bits per sample and stored on disc with the Speech Editing
System (Willems and de Jong, 1974). Then an LPC analysis program is run, yielding the
coefficients of a 10th-order digital filter and the amplitude parameter. From these
10 coefficients 5 second-order filter coefficients are calculated (each with 2 co
efficients). The pitch period and the voiced/unvoiced parameter are determined in a
separate program. These 13 parameters are then fed to the digital hardware synthesiser
(Rockland 4512) and the remade speech is available. Further details of the system
are described in the following sections.
pitch pitchextraction voiced -
unvoicedoriginalspeech amplitudein
stored LPC formant F1, B1F2 B2
speech analysis extraction F3 B3F4 B4F5 B5
hardware
synthesis
remadespeechout
storedspeechout
Fig. 1. Block diagram of the "Formator".
The LPC analysis
From the stored speech a 25 msec segment (250 samples) is triangularly windowed and_1
pre-emphasised by a first-order filter l-~z with ~=0.90. Then the 10 filter co-
efficients are determined by solving a set of 10 simultaneous equations which results
from a least squared criterion for the error between the actual and the predicted
input sample of the speech wave. In fact the autocorrelation method is used (Makhoul,
1975; Markel and Gray, 1976). After some further calculations which will be described
in the next section, the analysis window is shifted over 100 samples to the next
25 msec speech segment. Thus, every 10 msec the amplitude and the 10 filter coefficients
are updated. This frame rate of 100 Hz is a suitable value for normal speech; in
48 many cases steps of 20 or 30 msec also give good results.
Fig. 2. Time (left panel) and frequency (right panel)representation of a 25 msec speech segment (oneframe). The upper curve in each panel concerns theunwindowed signal, the lower curve is the triangularly windowed and pre-emphasised signal. In the middleof the right panel is shown the spectrum that results from the LPC analysis.
this voiced speech segment (the English vowel /~/ of
are easily discernible.
An example of the analysis result
of one frame is shown in Fig. 2.
The 25 msec speech segment, shown
at the top of the left panel is
windowed, pre-emphasised and then
plotted at the bottom of the
left panel. For display purposes
the corresponding fast Fourier
transforms of the two time sig
nals are plotted in the right
panel. This FFT is not used in
the LPC calculations. The 10 pre
dictor coefficients, which in
fact represent the impulse res
ponse of the digital filter, are
calculated and the corresponding
FFT spectrum (also calculated for
display purposes only) is shown
in the middle of the right panel.
It illustrates how the spectral
envelope of the filter fits that
of the {lower) speech wave. For
the word "call") the 5 formants
20 dB
10 20
TIME (msec)
o
wC::::).....JQ.
~
(1)
Formant extraction from the LPC data
The digital filter determined with the LPC analysis program is characterised by 10
filter coefficients {ak } and can be presented in the z-domain by
10 -kA(z) = 1 + L: akz
k=l
The polynomial (1) can also be written as a product of 5 quadratic terms:
5 -1 - 2A(z) = II (1 + Pi z + qi z )
i=l(2)
Calculation of the coefficients {po ,q.} from the coefficients {ak } can be done numer1 1
ically. Then we have a set of 5 {p.,q.} combinations representing a cascade of 5 di1 1
gital second-order filters equivalent to the 10th-order filter. These 5 second-order
filters can now be conceived as the 5 formants that we are looking for.
However, we are still left with two problems: (a) the pairs {Pi,qi}resulting from
the calculations are not naturally ordered on a frequency scale, while the formants
definitely are and (b) it is possible that, especially for consonants or fricatives,
one or more of the pairs {po ,q.} represent a filter whose poles are real. In that1 1
case we can not speak of a formant having a tuning frequency and a bandwidth.
These problems are solved by the application of a transformation procedure to the 49
{Pi,qi} pairs so that ordering on a frequency scale becomes easy. After that orderingprocedure p and q of each pair are limited to such values that they always correspondto complex pole pairs. We shall not go into further details here but only remark thatthese changes have no audible effect upon the spectrum of the ultimately resynthesised
speech segment. This is illustrated in Figs. 3a, band c, where examples are shown
of spectra in which only 4 peaks are present. In Fig. 3a (the upper curve) the
second formant is missing in the spectral envelope, owing to a real pz,qz pair. Ifwe force this pair to values corresponding to a complex pole pair with a large bandwidth the lower spectrum results in Fig. 3a. No difference can be discerned betweenthe upper and the lower curves.
Another example is shown in Fig. 3b where the third formant was missing and the
higher formants have a large bandwidth, so only 3 peaks are discernible in the speccrum.Making the P3,q3 pair complex causes the lower curve of Fig. 3b to be somewhatsteeper in the region above 4 kHz than the original spectrum (upper curve). Comparable
results are shown in Fig. 3c, where the fifth formant was missing.
U.IC:;)
!::...Ia.::E<C
a b c
o 1 2 3 4 5 o 1 2 3 4 5 o 1 2 3 4 5
FREQUENCY (kHz)
Fig. 3. The spectrum of the digital filter resulting from the LPC analysis before(upper curves) and after (lower curves) the formant extraction. In panel a thesecond formant is "missing", in panel b the third and in panel c the fifth formant(upper curves). They are "artificially" added by changing the second-order filtercoefficients so that the real poles become complex, yielding a formant with a largebandwidth (lower curves).
These examples illustrate that the errors introduced by the "forced complex making"
procedure have little or no effect upon the spectral envelope of the resulting digital
filter. The result is now that we always have 5 and only 5 formants and after the
definite assignment of numbers 1 to 5 inclusive, the complex and ordered {Pi,qi}pairs are used to determine the input parameters for the digital hardware device inorder to synthesise the speech wave.
Pitch extraction
The pitch period or fundamental frequency and the voiced/unvoiced decision are determined every 10 msec from a speech segment of 35 msec. This segment is long enoughto ensure that at least Z pitch periods are present in the waveform. For the pitch
50 extraction we use a modified version of Sondhi's (1968) method. First the spectrum
of the speech segment is flattened by a dynamic centre-clipping procedure and then
the "auto-sign-correlation function" (Rabiner, 1977: method 6) is calculated. One
of the maxima of this function is taken as the pitch period, provided it is positioned
within a specific interval. Position and width of this interval depend on position
and magnitude of the previous maximum. A high magnitude of the previous peak implies
a salient pitch period and in that case the interval within which the new peak has to
be found is narrow. A low magnitude, on the other hand, goes with a large possible
interval for the new pitch period. Proper choice of the boundaries can avoid pitch
doubling or octave jumps in the measured pitch. This method of variable window
width not only saves calculation time but also takes into account a certain continuity
that is always present in natural pitch contours.
Hardware synthesis
The speech wave can now be resynthesised by a digital hardware synthesiser. This
device consists of a cascade of 5 second-order digital filters excited by a quasi
periodic pulse (voiced sound) or by noise (unvoiced speech). It needs the amplitude,
pitch period, voiced/unvoiced and formant parameters at every pitch period. Since
the parameters in the analysis are calculated at 10 msec intervals, an interpolation
is necessary corresponding to the actual pitch periods. These interpolated para
meter values are then used as the input parameters for the synthesiser (Rockland4512) .
Practical use of the 'Formator'
This section contains a brief description of some possibilities and results of in
formal experiments with the "Formator". As details of the system are still being
improved "objective" test results are not yet presented.
Up to now a direct comparison between the stored input speech and the resynthesised
version has been performed for about 10 different speakers (male and female).
Several seconds of normal speech (Dutch and English mainly, different sentences for
different speakers) were analysed. The raw analysis results were smoothed with both
a running median smoothing over 5 frames and a linear filtering over 3 frames
(Rabiner et al., 1975). Examples of the raw and the smoothed data are shown in Figs.
4a and 4b for the sentence "I don't think it's necessary to call in the doctor".
Amplitude, pitch, voiced/unvoiced parameter and formant frequencies and bandwidths
are shown for a segment of 2 sec, almost the complete sentence. In the experiments
the input speech was immediately followed by the two resynthesised versions, the raw
and the smoothed data. Although, of course, slight differences were audible betweenthe original and the raw or smoothed resynthesised speech, the quality of the remade
speech was good.
Bit rate reduction with the 'Formator'
Once we have a description of the spectral envelope of the speech wave in terms
of ordered {Pi,qi} pairs related to formants, it is easy to quantise these parameters and hence reduce the bit rate. In our case the digitised input speech 51
:I:()IQ.
- -'-"---- ... -a
c
b
2.0
IIII •. d"1'" l'llr
1.0.5
4
2
5
4
1
3
1
'-. ~~O'---""--.-..'--........u...I-.....--..I.....---......._ ......__""'-'_-......._ ......."-......._-'- ~__~
5.------------,.-------------
4
2
3
>o 3ZLLJ 2::)
01LLJc:: 0 ~.....J-__--'-_
U. 5,....-----r--------------------,I""T""----"T""'T-,-n
-NJ:~-
TIME (sec)
Fig. 4. The 13 parameters calculated in the analysis part of the system plotted asa function of time. Upper panel: amplitude, unvoiced marks and pitch contour.Panel (a): raw data from the analysis; the length of the vertical bars is the formantbandwidth (in Hz) divided by 2 so as not to overload the figure. The formant frequencyis in the middle of each bar. Short bars indicate a narrow bandwidth and thus a sharpand high peak in the spectrum. Panel (b): smoothed formant data. Now the formant frequencies of the 10 msec frames are interconnected in order to show the formant tracks.Panel (c) the same sentence analysed at 40 msec steps and then quantised with 28 bits
52 per frame, resulting in a bit-rate of 700 bits/sec and still acceptable in quality.
needs 80 kbits/sec (10 kHz sample frequency, 8 bits per sample). The analysed
speech can be described with about 14 kbits/sec: pitch 8, amplitude 8, voiced/un
voiced 1, each formant frequency 12 and formant bandwidth 12, making 137 bits per
(10 msec) frame. Preliminary experiments with quantisation of the parameters down
to 28 bits per frame (amplitude 3, pitch 6, voiced/unvoiced 1, Fl up to F5 with
respectively 3, 4, 3, 0, 0 bits and Bl up to B5 with respectively 2, 2, 2, 1, and
1 bits) turned out to have aibmost no audible effect upon the resynthesised speech
compared with the unquantised version. Now we have a bit rate of 2800 b/sec. Still
further reduction of information content with little loss of quality can be achieved
by stylisation of the formant trajectories with an approximation by straight lines.
Another possibility is to increase the analysis step width from 10 msec to 30 or
40 msec. This not only considerably reduces the frame rate and hence the bit rate
but also the calculation time. In Fig. 4c an example is shown of the same sentence
as in Figs. 4a and b, but now analysed with frame steps of 40 msec and then quantised
with the same number of bits as mentioned above. This resulted in a description of
the speech with 700 bits/sec and still acceptable in quality.
The 'Formator' as 'Intonator'
Another interesting feature of the analysis-synthesis system is the possibility
of stylising the pitch contours. The natural pitch contour, measured in the ana
lysis part of the system, can easily be replaced by a stylised intonation contour
of arbitrary shape. Thus the experimenter immediately obtains an impression as to
which pitch movements are relevant to the overall intonation pattern (Collier and
't Hart, 1975; It Hart and Cohen, 1973) and which are not, simply by comparing the
speech with the measured pitch contour with an artificial stylised version and
then judging whether they are perceptually equivalent or not.
An example of two perceptually equivalent intonation patterns is given by It Hart
(1977). One advantage of the present system compared with previous "Intonators"
(Willems, 1966) is the better quality of the remade speech.
Summary
We presented the "Formator", a speech analysis-synthesis system based on a
Linear Prediction Coding of the speech wave followed by a formant extraction pro
cedure. At present the analysis is still performed in software and a 5-formant
analysis with a frame rate of 100 Hz takes about 30 times real time. Pitch is
measured in a separate program, taking about 20 times real time. This "Formator"
looks like becoming a promising system, not only for phonetic research but also
in the field of low-bit-rate vocoders, voice response units and speech recognition.
References
Atal, B.S. and Hanauer, S.L. (1971) Speech analysis and synthesis by linear prediction of the speech wave, J. Acoust. Soc. Am., iQ, p. 637-655.
Collier, R. and 't Hart, J. (1971) A grammar of pitch movements in Dutch intonation,I.P.O. Annual Progress Report, &., p. 17-21. 53
54
Fant, G.C.M. (1960) Acoustic theory of speech production, Mouton & Co,'s-Gravenhage, The Netherlands.
Flanagan, J.L. (1970) Synthetic voices for computers, I.E.E.E. Spectrum, October1970.
Flanagan, J.L. (1972) Speech analysis, synthesis and perception, Springer,Berlin.
Makhoul, J. (1975) Linear prediction: a tutorial review, Proc. I.E.E.E., ~'p. 561-580.
Markel, J.D. and Gray, A.H. (1976) Linear prediction of speech, Springer, Berlin.
't Hart, J. and Cohen, A. (1973) Intonation by rule: a perceptual quest, J.Phonetics, 1, p. 309-327.
't Hart, J. (1977) Pitch contour stylisation on a high-quality analysis-resynthesissystem, this issue.
Rabiner, LoR. (1977) On the use of autocorrelation analysis for pitch detection,I.E.E.E. Trans. ASSP-~, p. 24-33.
Rabiner, L.R., Sambur, M.R. and Schmidt, C.E. (1975) Applications of a nonlinearsmoothing algorithm to speech processing, I.E.E.E. Trans. ASSP-~, p. 552-557.
Sambur, M.R. (1975) An efficient linear prediction vocoder, Bell. Syst. Techn.Journ., ii, p. 1693-1723.
Sondhi, M.M. (1968) New methods of pitch extraction, I.E.E.E. Trans. AU-lQ,p. 262-266.
Willems, L.F. (1966) The Intonator, I.P.O. Annual Progress Report, 1, p. 123-125.
Willems, L.F. and de Jong, Th.A. (1974) Research tools for speech perceptionstudies, I.P.O. Annual Progress Report, ~' p. 77-81.
Pitch contour stylisation on a high-quality analysis-resynthesis system
J. 't Hart
Introduction
As has been reported in earlier issues of this report, as far back as the very first
one ('t Hart, 1966) intonation studies have fruitfully been based on the possibility
of making stylised, artificial pitch contours as perceptual equivalents to original
Fo courses by means of the Intonator (Willems, 1966). This Channel-Vocoder based
instrument can easily and quickly be manipulated with respect to location, slope and
duration of changes of fundamental frequency. The artificial pitch contour is dis
played on a large-screen oscilloscope. Within two successive revolutions of the
tape loop with the input message, the selection of some preset changes can be altered
leaving desired reference points of the contour unchanged, thus providing a facility
for direct comparison of two contours.
The apparent possibilities for stylisation have been made plausible by providing some
psycho-acoustic backgrounds for them ('t Hart, 1976): the differential sensitivity
to size, location and slope of frequency changes is rather limited. In the same paper,
however, it is admitted that our usually far-reaching stylisations with standard move
ments often go beyond the threshold of sensitivity estimated there.
A main drawback of the Intonator is that its output signal has poor quality. The
question which then arises is whether this could not be a reason for the larger toler
ances than can be explained on psycho-acoustic grounds. The listener might be ham
pered by this poor quality and thus be unable to concentrate adequately on the pitch
phenomena. Moreover, since a number of the psycho-acoustic background experiments
have been done on the basis of Intonator-processed stimulus material, their results
may be affected by the same lack of quality as well.
Recently an LPC-based analysis-resynthesis system has become available, and one of
its applications is, understandably, that of a high-quality Intonator (Vogten and
Willems, this issue). This provides a direct opportunity for examining the possible
influence of poor quality.
Stylisation on the LPC system
This is a report of our first attempt to obtain a maximally stylised contour for a
Dutch sentence by means of the new system, and to compare it with what had appeared
to be possible with the old Intonator. The sentence is: "lk geloof aan het volmaakte
van al het gebeuren". ("I believe that all that happens is perfect", Helene Kroller
Muller's epitaph), as spoken by a professional broadcaster.
Fig. 1 shows the progressive stylisation applied in the earlier experiment with
the Intonator. The dots indicate the outcome of an objective measurement of Fo '
carried out later, together with the stylisation on the new system. The scattered
dots in the final part originate from erroneous measurements due to low amplitude
of the last syllable. In view of our present aim, no attempts have been made to ob- 55
IPO annual progress report 12 1977
tain correct measurements for that syllable. The upper solid line (indicated by 1)
is an approximation in which the syllables "-loof", "vol-", "-maak-", "-te", "van",
"al" , "het", "beu-", if made audible in isolation, all sound at the same pitch as in
the respective isolated syllables of the original. In successive stylisations, the
dent in "vol-" (at position 2), the downstep in "-te" (at position 3) and the over
shoot in "al" (at position 4) have been smoothed out. This is not audible when
listening to the contour as a whole. Line S represents "intonation by rule": al
though this is a completely acceptable contour as such, it is clearly distinguishable
either from the original or the other stylised contours, mainly because the peaks
are experienced too low; the standard excursions of 4 semitones are smaller than the
actual ones. This can easily be corrected by local adaptation of the excursions,
but almost equally well by raising the entire contour 2 or 3 semitones (not drawn in
the figure).
0- 0
60
200r-----------------------------------------,
180
t160
140N:I:-120
tiffi100
a90wff: 80
70
5O,.l-.......~__,.'--- ......._+-'--~.,.......__;-"'- .......-'----'-+_...,..---'"__,_,'---'--2+--+-'----'.......,-'--'-7,-'-.,.....--l._"'_-+-.=.....~o 1 ,: 3 sec.
I k 9 lei 00 aan hell v : 0 I m a a k t: e v; a n :a I he It 9 :e bl eu r e n
---~~- TIME (sec)
Fig. 1. The various stylised pitch contours as applied in the earlier experimentwith the Intonator. Line S is "intonation-by-rule"; that contour can easily bedistinguished from the original. Dots indicate F measurements.
o
.....
60
tiffi 100a 90wff: 80
70
200.,---------------------------------------,
t::~140
50l------------_----- ----------~--......o 1 2 3 sec.
I k gel 00 f aan hel v 0 I m a a kl e van a I he t 9 e b e u r e n
--_~- TIME (sec)
S6Fig. 2. The pitch contours used with the LPC-system. Same sentence, same Fo measurements (dots) as in Fig. 1.
Fig. 2 gives the situation with the new system. Again, the dots indicate the Fomeasurement. Line 1 corresponds roughly to line 1 of fig. 1, except for the downstep
in "-te", which is present in line 2, in which the gradual fall after the first peakhas also been replaced by a more rapid one. Line 3 can be characterised as "intonationby rule", with locally adapted excursions. Again, the dent, the downstep and theovershoot have been removed.
In direct comparisons of each of the four stylised versions with the original (3
trained listeners), we could only hear differences if we knew beforehand which of thestylised versions was going to be made audible. These differences were, too high pitch
in "aan het" with the gradual fall of line 1, and absence of difference in pitch between "al" and "het" with the omission of the overshoot in line 3. The knowledge as'
to which contour is presented facilitates concentration on those particular portionswhere differences are expected according to the visual deviations. And yet we were
unable to hear any of the other clearly visible differences.
Conclusion
First of all, we may say in general that the new system provides us with a splendid,
precise and highly reliable Intonator, more versatile than the old one, and with a
much better quality. With respect to the particular problem put forward above, thereseems to be no reason to fear that a high-quality analysis-resynthesis system willimpose more restrictions on the degree of possible stylisation than a poor qualitysystem.
References
It Hart, J. (1966) Perceptual analysis of Dutch intonation features, I.P.O. AnnualProgress Report, 1, p. 47-51.
It Hart, J. (1976) Psychoacoustic backgrounds of pitch contour stylisation, I.P.O.Annual Progress Report, 11, p. 11-19.
Vogten, L.M. and Willems, L.F. (1977) The formator: a speech analysis-synthesissystem based on formant extraction from linear prediction coefficients, this issue.
Willems, L.F. (1966) The Intonator, I.P.O. Annual Progress Report, 1, p. 123-125.
57
Auditory feedback as a factor in disrupted speech production
A.F.V. van Katwijk
Introduction
Does the auditory perception of one's own voice interact with the ongoing speechproduction?
Auditory feedback of speech has been extensively discussed mainly in connection with
stuttering and the effects on speech of delayed auditory feedback (DAF). Cherry
and Sayers (1956) found that stutterers stop stuttering if they are prevented from
hearing their own voices. Fairbanks and Guttman (1958) analysed the articulatory
errors of subjects who heard their own voices with delays of 0, 0.1, 0.2, 0.4 and
0.8 sec. Disturbance was maximal when the delay was 0.2 sec. Prominent disturbances
occurred most frequently in stressed syllables, and concerned lengthenings and
additions. Repetitions i.e. double articulations made up 70% of the additions, and
were considered the most characteristic feature of the disturbances. They were
interpreted in terms of a temporary restitution of the normal feedback relationships.
Our own observations and assumptions lead us to view repetitions rather as the in
voluntary result of recycling of speech material in articulation.
On the general problem of what one does with the auditorily perceived own speech,
Lane and Tranel (1971) argue in a well-documented paper that the hypothesis of a
direct sidetone control loop lacks adequate empirical support. Instead they suggest
that communicative demands largely determine what one does with one's speech. With
respect to stuttering they go on to suggest that "the experimental findings with
stutterers and normals lead to the same conclusion: the less the sidetone monitoring,
the more normal speech is possible." (p. 699).
The question can be raised whether sidetone monitoring has anything to do with
stuttering. Bloodstein (1975, p. 264) discusses the hypothesis of feedback control,
and doubts its relevance in view of the many facts that are not directly plausible
under it. For instance the fact that "stuttering tends to occur at the moments of
initiation of speech units where auditory feedback is absent".
The literature has many other indications that stand in the way of a simple feed
back hypothesis on stuttering.
Observations and considerations
This contribution deals with some observations on "DAF speech" as recorded during
an experiment with 12 subjects. The material furnishes some pertinent details as
to how auditory feedback signals may run through the speech production lines.
As regards these production lines, we must assume, even without DAF, that there is
a time lag of 0.1 sec or more between the execution of an articulatory programmeand the acoustic result. With DAF, a delay of say 200 ms implies a temporal lag
between execution and perceived auditory signal of 300 ms or more. Note that
Fairbanks and Guttman (op.cit.) have found that the delay duration is positively58 correlated with the number of speech sounds involved in repetitions under DAF. This
IPO annual progress report 12 1977
span can be related to the number of speech sounds that have been produced before
the feedback loop is closed.
In stuttering and DAF disruptions, these considerations imply that there is a time
lag between production and perception at the basis of a limited number of specific
speech disruptions. Stuttering would in this view - as far as feedback information
could at all playa role - include repetitions of single short segments, of which
the durations agree with the presumed internal production span of at least 0.1 sec.
Morton (1968) speculates that non-stuttering is made possible by inactive periods
in the units that produce motor command patterns.
According to this view we may speculate as to what happens if the ongoing speech
production process is "hit" by the delayed sounds it has recently produced. A
first effect might be in the programming level, where the planned motor patterns is
run into by a competing pattern derived from the external feedback loop. The more
the patterns are different the less likely it is that an effective motor command
pattern will emerge. On the other hand, if the internal and external patterns are
similar, they might both be accommodated in the programme and executed one after the
other. This would lead to repetitions.
The material that has been analysed for this discussion was collected by Riet Dekker,
psychology student from Utrecht, who performed the DAF experiments as a part of
her training programme. I have concentrated on repetitions and lengthenings and on
the syllables in which these take place. We had 12 subjects repeat three-syllable
words in a normal feedback condition and in a DAF condition. The stimulus words
had been recorded in synchrony with a click pattern which had 214 ms interclick
intervals. The delay time of the feedback under DAF was also 214 ms, to ensureapproximately stable phase relationships between syllable durations and delay in
tervals. Precise durations of stimulus and response words and quantitative analyses
of the experimental variables and their effects will be described in due course
elsewhere. In this account we will mainly look into phonetic details of recorded
performances. Listening to repetitions, elongations, pauses and other deviations
an independent judge (phonetically trained) and myself indicated the location of
these deviations for six subjects. Counting only agreed instances, we found that
there were deviations in 2 out of 336 first syllables, in 11 out of 336 second
syllables, in 70 out of 336 both second and third syllables and in 1S8 out of 336
third syllables. This may be interpreted to show that auditory feedback does affect
speech production under DAF, but only after at least 300 ms have elapsed of speech
produced under the planned programme.
Listening again to all twelve subjects I found the qualities of the disruptions of
the second and third syllables characteristically different and revealing:
- Lengthening occurs most frequently and with the largest increments in third syl
lables. There were also duration increments in the second vowel and the thirdconsonant.
- Repetitions were either whole syllables, or vocalic parts of syllables. Repeated
syllables occurred almost exclusively when the target words had identical or
homorganic consonants and identical vowels (Tables I and II). The 18 target words S9
with heterorganic consonants (but identical vowels) gave rise only to two ambiguous
production errors: the target word pitiki (Table III) might in the first place
have been perceived as a word with two or three identical syllables.
Vowel repetitions are classified in Tables IV and V. The insertion of a vowel in the
third syllable has not occurred often, and when it did, there does not appear to be
a distributional condition with respect to place or manner in the articulation of
the consonants, which would favour the error.
d t'ttarge pro uc lon
pubumu pubumum/ ud w
tudunu tudu /nunu
mupubu mupububu
mapababmapa / aba
mipibi .. bl·b·mlpl 1 1P
mabapa mabapapa ( 2x)
nutudu nutududu ( 2x)
natada natadada
nitidi ni tididi
tudunud
tudu /nunu
target production
Ipapapa papapapa (3x)
tatata tatatata
bababa babababa Ox)
dadada dadadada
gagaga gagagaga
mamama mamamama ( 4x)
nanana nananana ( 2x)
Table I. Production errors.Targets with same consonants,same vowels (8 stimuli and12 subjects).
Table II. Production errors. Targets withhomorganic consonants, same vowels (18stimuli and 12 subjects).
target production
pitiki
(same)
pikikiki
pipipipi
Table III. Production errors. Targets with heterorganic consonants, same vowels(18 stimuli and 12 subjects).
Vowel repetitions occurred in the instances given in Table IV and V.
target production target production
pabumi
dinatu
pabumui
dina tau
patuki
badugi
digabu
patukui
badugui
digabau
Table IV. Production errors. Targetswith homorganic consonants, differentvowels (6 stimuli and 12 subjects).
Table V. Production errors. Targets withheterorganic consonants, different vowels(6 stimuli and 12 subjects).
Discussion of lengthenings
The lengthening of the third syllables is the main effect of DAF on the performances
of our subjects. It occurred in well over 50% of the productions. As regards the
mechanism of lengthening at work here, could it be a recycling of vowel information
60 from auditory feedback into the ongoing vowel production? This would be impossible
in view of the temporal limits of the events: on the assumption that a DAF delay
of 200 ms implies a real delay between programme execution and perceived signal of
at least 300 ms, a regular vowel should be finished long before its beginning has
become available via the feedback loop. This applies a fortiori to consonants.
It would seem more realistic to interpret lengthenings as the results of a slowing
down process arising from incompatibility of internal and external signals. The
lengthenings often make the impression of uncontrolled continuation, a kind of free
wheeling.
Discussion of repetitions
The occurrence of syllable repetitions was limited to words where the syllables were
similar. Similarity here means: same place of articulation of consonants, same
vowels (Tables I and II).
The distribution of production errors in the tables cannot be taken to imply more
than an indication that place of articulation and sameness of vowels are possible
conditions for the occurrence of syllable repetitions. ~fuat seems to happ~n is that
the feedback of a compatible syllable is accommodated in the production programme
that is being executed, and then inserted in it. The occurrences of vowel perseve
rations indicates that if the feedback vowel occurs at the proper moment, it is also
included in the ongoing production, and inserted within the planned CV programme.
An obvious next step is to analyse the precise temporal relationships between first
and second occurrences of speech sounds under DAF.
The speech disruptions of real stutterers are for a large part the result of compen
satory and substitutory articulations, which makes ir difficult to analyse what part
might derive from auditory feedback. If auditory feedback plays a role at all in
stuttering, the time interval between programme execution and perception would be
much smaller than in DAF, so that other types of production errors should be ex
pected.
As to the precise role of auditory feedback, Fairbanks and Guttman (op. citl,pointing
to the fact that DAF speech has mainly single repetitions, suggest that a single re
petition would temporarily restore the normal feedback conditions. Our limited data
seem to indicate however that the conditions for the occurrence of a repetition are
very restricted, and that these conditions as a rule have disappeared as soon as a
repetition has been produced, which makes a second repetition highly unlikely.
Summary
The analysis of production errors under DAF shows that the delayed auditory signal
may introduce extraneous commands to the articulators that are at that time in the
act of executing a planned speech programme. The associated errors are repetitions
of syllables or single vowels, where the repetition of syllables appears to occur
only if the delayed signal and the programmed syllable have coinciding similarities. 61
62
The more general disturbing effect of DAF - lengthening - seems to derive fromcoinciding dissimilarities between programme and auditory signal.
References
Bloodstein, O. (1975) A handbook on stuttering, revised edition, Nat. Easter SealSoc. for Crippled Children and Adults.
Cherry, C. and Sayers, B.McA. (1956) Experiments upon the total inhibition ofstammering by external control, and some clinical results, J. Psychonom. Res.,.1, p. 233-246.
Fairbanks, G. and Guttman, N. (1958) Effects of delayed auditory feedback uponarticulation, J.S.H.R., .1, p. 12-22.
Lane, H. and Tranel, B. (1971) The lombard sign and the role of hearing in speech,J.S.H.R., li, p. 677-709.
Morton, J. (1968) Considerations of grammar and computation in language behavior,Studies of language and language behavior, Ed. J.C. Catford VI, p. 499-545.
Vowel length and the perception of prosodic boundariesD. Bouwhuis and J. de Rooli
Introduction
In the context of a research project set up to investigate the relative contributions
of temporal structures and pitch contours to the perception of prosodic boundaries,
perceptual measurements, designed to establish the contribution of vowel length
have been carried out. In an ambiguous utterance with 3 potential prosodic bounda
ries, durations of the 3 relevant vowels have been varied, together and separately.
Results of the perceptual measurements will be described. The contribution of each.
individual vowel to the combined effect of the 3 vowels together will be assessed ina quantitative description.
Perceptual measurements
The ambiguous Dutch word string "Daan zei de baas is te laat" (English equivalent
"Dan said the boss is late") can, depending on the location of prosodic boundaries,
be perceived as meaning either that according to the boss Dan is late (reading I:
Daan, zei de baas, is te laat) or that according to Dan the boss is late (reading II:
Daan zei, de baas is te laat}.
A spoken version of reading I was processed by means of a computer-controlled channel
vocoder; thus we were able to replace pitch fluctuations by a slowly declining pitch
and to vary the durations of the vowels in "Daan", "zei" and "baas", respectively.
The combined contribution of these 3 vowels was investigated in the following way.
Seven versions of the utterance were constructed; the 1st and 7th versions, with
respect to the durations of the 3 vowels had the durational organisation of spoken
versions of readings I and II, respectively. In the 1st version the 1st vowel was
252 ms, the 2nd 130 ms and the 3rd 223 ms; in the 7th version durations were res
pectively, 162 ms, 190 ms and 163 ms. The other versions had in-between values:
the 1st vowel decreased in steps of 15 ms, the 2nd increased in steps of 10 ms and
the 3rd decreased in steps of 10 ~ from the 1st to the 7th version. All other seg
ment durations were equal in all versions.
Stimulus sentences thus obtained, were presented in random order, 10 times each, to
10 listeners for the identification of the reading. From this identification the
perception of the concomitant prosodic boundaries has been inferred. Results of 3
different presentations are given in fig. 1. They show that changing the durations
of only the 3 vowels reliably alters prosodic boundary perception.
At the same time the relative contribution of each of the 3 vowels to prosodic
boundary perception was investigated. This was done by varying only one of the
vowel durations at a time and keeping the other two at the values of stimulus number
4. These values had been found (in an experiment not described here) to correspond
to the estimated 50% cross-over point between both sentence readings.
It may be seen that the 1st and 2nd vowels are equally effective in determining
IPO annual progress report 12 1977
63
boundary perception but less so than the combined 3 vowels. It can also be seen
that the 3rd vowel has no effect at all.
lOSS
o 1 st vowelx 2nd vowel• 3d vowel
lOSS
OL....,.---~__.._-_-~-..._l
100,-----------%
-oc:.2...oQ.o...Q.
1 2 3 4 5 6 7 1 2 3 4 5 6 7
stimulus number stimulus number
Fig. 1. Proportion of reading I responsesas a function of vowel durations in theitalicised words of the ambiguous wordstring "Daan zei de baas is te laat",in 3 different presentations.
Fig. 2. As in Fig. 1. However, now only oneof the vowel durations has been varied ata time, the other 2 having the values ofstimulus number 4.
A quantitative analysis
In this analysis we want to assess whether the 1st and 2nd vowels contribute indepen
dently of each other to the combined effect found in the 1st experiment. The whole
utterance might be conceived of as a Gestalt-like pattern in which the effect of all
relevant vowel durations together is superior, or even incomparable to the effect
of the separate vowels. At the other extreme the combined effects of single vowel
durations might fully account for the results obtained by the simultaneous duration
variations. In between the separate effects may possibly reinforce each other in a
kind of interaction which increases their effectivity. Of all three, the second
condition would seem attractive from a theoretical standpoint, since any type of
dependence must be specified. Furthermore, the paucity of the present data does not
allow a large number of assumptions to be made. Therefore the independent processing
model will be explored here, and it will actually be found that more complicated
models cannot be tested on the results.
There are then two main applicable models, the Choice model (Luce, 1959) and the dis
tance model following the reasoning of Thurstonian scaling (Bock and Jones, 1968).
The nature of each of these models will be briefly discussed and a data fit presented.
The presented analysis is the description of a first stage of data analysis. New
64 results based on another estimation procedure becoming available after the time of
closing for press, in line with the published ones, will be presented elsewhere.
The choice model
This model describes the perception of one reading as the result of choosing between
2 response strengths, 1 for each of the 2 possible readings of the presented ut
terance. Each presentation of the utterance in some physical form is supposed to
change the response strength of a given answer. In unambiguous utterances this
would lead to an increase of the corresponding response strength, compared to which
the strengths of all other possible responses would be negligible. For the present
utterance 2 responses are feasible, and what the choice model (and the distance mo
del) describe is the probability with which one or the other has been chosen. With
out auditory information the probability of reading I might be expressed as P(I) =
1/(1 + S). The response strength of reading I is taken here to be equal to 1, S being
the response strength of' reading II. Any auditory information relevant to one of thereadings would tend to change these parameters. In the Choice model this occurs in a
multiplicative way. Upon hearing the varied length of the vowel in /d a:n/ the res
ponse strength for reading I would be 0., i indicating the particular length, ° being~
the product of the original response strength 1 and the effect of variation of the
vowel length. In this condition the probability of choosing reading I would be
P(II(Daan).) = 0./(0. + S).~ ~ ~
For all 7 vowel variations a choice probability has been obtained, by which 7 of the
response strengths could be estimated. This, however, precludes an estimation of the
a priori response strength of reading II, the value of S. Looking at the variation
data of the 3rd vowel /ba:s/ it seems that its length contributes little or nothing
to the choice of one particular reading. The slight scatter could also be wholly due
to stochastic fluctuations. For the present purpose the effect of vowel length in
/ba:s/ will be assumed as fixed. The factor B can then be estimated from the average
choice probability in situations where only the length of the vowel in 'baas' was
varied, making it more reliable than the factors 0. The response strength factor Bis thus found to be about 1.3,slightly higher than that of reading I. Obviously a
slight criterion shift took place in the choice process with respect to the earlierexperiment where 'neutral' vowel lengths had been chosen to yield a choice probabi
lity of 50%.
Hence, by obtaining the value of B, all other values for the response strengths can
be easily estimated from the corresponding choice probabilities. In this way 7 esti
mates of the response strength for 'Daan' are obtained as well as the s factors de
noting the response strengths for 'zei'.
The effect of the two vowel lengths,o and s,must also combine multiplicatively. Thus
for a particular combination of vowel lengths denoted by i, the probability of
choosing reading I is
P(II(Daan, zei, baas).)~
realising that the effect of 'baas' is negligible. The effects of length variations
of the first two vowels are completely contained in 0i and si' the factor B remainingconstant. 65
The distance model
Here the effect owing to vowel length is
assumed to result in an internal represent
ation strength which is a stochastic vari
able. It may take on several values for
the same input at different times and
its distribution is assumed to be normal.
as those for variations of only /ba:s/.
So parameters for these latter variations
have been adapted, not really estimated.
oooo
1001...---------------,%
The listener has an internal criterion
such that when the representation strength
exceeds it the response reading I will
follow, otherwise reading II. When theutterance containing given vowel lengths has been presented a number of times the
criterion of the listener will sometimes have been exceeded and sometimes not, de
pending on the exact vowel lengths. The relative frequency of occurrences of
reading I responses is an estimate of the probability that the internal representation
strength had a value as high as or even higher than the criterion. Hence this proba
bility may be seen as the area under the normal distribution curve to the left of
the criterion, where the mean of the distribution corresponds to the average of the
representation strength produced by the vowel lengths. This average reflects the
relative strength on the response axis produced by the vowel lengths, each vowel
length leading to a different value. In such a model the increases in representational
strength are thought to be additive and consequently effects caused by two vowels
must be added on the strength dimension.
These predicted probabilities should then be the same as those found in the experi
ment where all three vowel lengths were varied. The fit of these predicted probabilities is shown in fig. 3 as the uninterrupted line. There are some minor non-system
atic deviations but the overall fit is quite good. The proportion of explained vari
ance is .986, which is about maximum considering the sampling variance of the present
data. In fig. 3 are also shown the assumed
probabilities for 'baas', which werefixed as a consequence of the estimation
procedure. Taking these data into account
too, the proportion of explained variance
decreases to .969, which is still quite
acceptable. This prediction therefore
takes into account both probabilities
for simultaneous vowel variations as well
Fig. 3. Observed and predicted reading Iresponses as a function of stimulus number.Circles refer to responses averaged overthe 3 presentations of Fig. 1. Boxes referto the 3rd vowel variation only of Fig. 2.
- 00 0
c:::.2 gl observed 0-...0 choice model
"Q.0 distance model 0............... '" "Q. '"
01 2 3 4 5 6 7
stimulus number
For estimation of these strengths the effects of vowel length in 'baas' will again
be taken to be fixed at .44, this value being denoted as b. The effect of vowel
length in 'Daan' can then be designated as d. - b, the corresponding effect of 'zei'1
by zi - b. For any prediction of the effect of both vowels, the influence of 'baas'
and the rest of the utterance is necessarily present, so that the prediction equation
66 is:
s (I I (Daan, zei, baas)i) = d i - b + zi - b + b
Here s is the predicted representation strength from which the response probabilitycan be immediately deduced.
The predictions of this model are shown in fig. 3. along with those of the othermodel. It can be seen that in the middle part of the curve the predicted values
of both models coincide. Only the predictions of the end points are somewhat tooextreme compared to the experimental values. The proportion of explained variance
is almost the same: .985; together with the assumed fixed values for the effect of'baas' it amounts to .967.
Discussion
It is not unexpected that both types of predictions are so closely related sincethe function of the choice model, often called the logistic, is quite similar to
the cumulative normal distribution and is sometimes used as an alternative. Except
for slight differences at the extremes the correspondence here also seems satis
fying. The good fit of both models suggests that the effects of different vowels
are independent of each other. Within the framework of the models the contributionto the final response of one vowel length is not conditional upon that of another
vowel length. Such a phenomenon supports the view that the operations involved inutterance interpretation are basically simple.
On the other hand ~he estimation procedure employed here does not allow the existence
of other parameters possibly defining dependence between vowels. The number of ob
tained values completely covers the number of parameters for effective vowel length
except for those in 'baas'. The present data can therefore not offer any information
on higher-order models. However, this seems unnecessary in view of the good corres
pondence found in the application of the independence models. Finally, the distinction
between multiplicativity and additivity as employed in this context possibly deservessome additional comment.
We note that in the choice model the parameter 0 may also be written as eD, where e
is the base of the natural logarithm and D represents the strength owin2 to the vowellength. Consequently the product o~ may be written as e D. e Z= e D
+Z, where additivity
applies in the exponent. In the cumulative normal distribution the nature of the
d- and z-parameters is somewhat analogous, though an analytical expression for itdoes not exist.
The most important conclusion reached from application of both models is that, inspeech, different units contribute independently to the interpretation of an utter
ance. In the present experimental data, independence can only be established by
means of a model since no direct measures are available.
The productive value of the independence is that the operations in speech perception
are sim~le in structure which is attractive to further experimentation.
67
68
Summary
In a perceptual experiment, vowel length variations in an utterance could determine
prosodic boundary perception. Boundary perception was inferred from the interpre
tation of the utterance. There were 2 conditions, the length of a single vowel was
varied systematically in one only, in the other the length of all 3 vowels concerned
was varied simultaneously.
From a quantitative analysis it was found within the chosen framework, that the
separate vowel lengths contributed independently of each other to the perception of
boundaries in speech.
References
Bock, R.D. and Jones, L.V. (1968) The measurement and prediction of judgment andchoice, Holden-Day, San Francisco.
Luce, R.D. (1959) Individual choice behavior, J. Wiley & Sons, London.
The perception of English intonation by Dutch and English listenersR. Collier
Introduction
The present report concerns the initial part of a research program on the comparison
of Dutch and English intonation. Our intention is to subject English intonation to
the same perceptual analyses as have led to an advanced knowledge of Dutch intonation
('t Hart and Cohen, 1973; 't Hart and Collier, 1975).
The problem
In the study of the perceptual aspects of Dutch intonation we have, among other
things, encountered the following problem. From the acoustical point of view unlimited
variability is noted in the course of the fundamental frequency, but some of these
physical variations are not perceived in the perceptual analysis. In the acoustical
resynthesis of an utterance one can therefore smooth the course of the fundamental
without affecting the perceptual equivalence between the original utterance and its
highly stylised copy. However, the degree of perceptual equivalence between an
original and its copy is differently evaluated, depending on the mode of listening.
One can hear differences by lIstening "analytically", but when listening in a "broad"
fashion, as in a normal communicative situation, one's threshold becomes markedly
higher. In the latter situation various analytical differences between pitch contours
(whether caused by stylisation or not) are judged to be of secondary importance and
are considered to be possible variations in the realisation of the same intonational
category or "pattern".
Several experiments have explored the extent to which Dutch listeners can map dif
ferent pitch contours onto the intonation patterns of their language (Collier and
't Hart, 1972; 't Hart and Collier, 1975; Collier, 1975). In these experiments sub
jects were presented with up to 20 different pitch contours which they were asked to
group according to a criterion of melodic resemblance.
The same kind of experiment has been repeated with English-speaking subjects who, in
their turn, had to group 20 pitch contours of their native language. The same ex
periment was also run with Dutch-speaking subjects who were presented with utterances
in a foreign language that they were familiar with, but only to a limited extent.
The following hypotheses were formulated: (1) language users are capable of grouping
20 different pitch contours into a fairly limited number of intonational categories;
(2) the categorisation by the English speakers corresponds to the basic intonation
patterns as described in manuals on English intonation; (3) the categorisation by
the Dutch speakers differs from that made by the English speakers.
The experiment
Halliday (1970) has described in some detail the melodic characteristics of English. 69
IPO annua 1 progress report 12 /977
He distinguishes 7 basic intonation patterns, called "Tones", and indicates for
each a number of variants in their realisation. His description is accompanied by a
set of tape recordings containing examples of the various Tones. From among theseillustrations 20 utterances were selected which represent the 7 Tones in at least2 and at most 5 variant realisations.
2
/'J ~\,,,\ )20
\.\-
17
I~
18
10
12
13
11
3
"",,,-- \I -\.
"4
,,\~\
"5
,,\ , ~r'\
7
"'\ ~Tone 1 : 1,2,3,4,5
JTone 2 : 6,7,8,9
Tone 3: 10 , 11
Tone 13 : 12, 13
"Tone 4: 14 , 15,16
~" Tone 5: 17, 18
Tone 53 : 19,20
Fig. 1. The fundamental frequency of the 20 stimuli and their categorisation in70 "Tone 5" according to Hall iday.
Fig. 1 presents the fundamental frequency of each utterance. The categorisation
proposed by Halliday is the following: utterances to 5 = Tone 1, utterances 6 to 9
= Tone 2, utterances 10 and 11 = Tone 3, utterances 12 and 13 = Tone 1+3,
(a combination of Tone 1, followed by Tone 3) utterances 14 to 16 = Tone 4, utterances
17 and 18 = Tone 5, and utterances 19 and 20 = Tone 5+3.
The 20 utterances were recorded on "Language Master" cards (Bell-Howell). In this
way the subject has immediate random access to the utterances, which he can compare
at will in any order. The first group of subjects consisted of 13 English speakers,
the second of 14 Dutch speakers. They were requested to sort out the utterances
according to the criterion of melodic resemblance. The number of groups into which
to divide the set of utterances was left to their own judgment.
Results
A
- - TTT
I
7 6 9 8 191420135 1 4 3 2 11101817121615
3
9
8
7
6
5
4
10
Counting the number of times each utterance has been grouped with each of the other
utterances, we obtain a score indicative of the degree of melodic similarity among
the individual pitch contours. Melodically very similar utterances form a coherent
pair or group. In such a group a member may also show a (weaker) relationship to a
member of another group. The groups are
thus not neatly separated but stand per
haps in a hierarchical relationship to
one another. This hierarchy can be com
puted on the basis of the scores of the
individual utterances, using an algo
rithm designed by Johnson (1967). Fig. 2
presents the results of the "maximum"
method of the hierarchical clustering
analysis. If two or more utterances join
at a high level in that figure, it means
that their melodic similarity has been
assessed as strong by the subjects.
10 7 613129 8 191416152011 5 4 3 2 1 1817
11
10
9
8
7
6
5
4
B
Fig. 2. Results of the "maximum" method ofthe hierarchical cluster analysis, accordingto Johnson. A: English subjects, B: Dutchsubjects.
English-speaking subjects (see Fig.2A)
One can see that the groupings between
levels 10 and 6 involve virtually only
pairs of utterances. The members of one
pair belong to the same Tone, with the
exception of (14, 19). Only representatives of Tones 1, 2 and 4 are grouped;
the variants of Tones 3, 5, 1+3 and 5+3
never constitute a cluster, not even at
a relatively low level. At level 5 all
the variants of Tone 1 are grouped:
(1, 2, 3, 4, 5). Tone 1 is the only one
whose variants are all grouped at some
level. At level 5 group (12, 15, 16) is 71
also formed, but this group is composed of representatives of both Tones 1+3 and 4.
It is striking that the variants of Tone 1 only cluster at level 5, whereas at level
6 group (13, 14, 19, 20) emerges which mixes Tones 1+3, 4 and 5+3.
An explanation why pairs like (6, 7) or (15, 16) are grouped at a high level can
not be based solely on the degree of physical resemblance as shown in Fig. 1. In
fact, that resemblance is no greater than, for instance, that between contours 12
and 13, which are not taken together. On the other hand, there is little physical
similarity between contour 12 and the pair of contours (15, 16) that do form a
group at level 5.
Dutch-speaking subjects (see Fig. 28)
The formation of pairs by the Dutch subjects corresponds fairly well to that by
the English listeners, but the groupings are made at a higher level.
The pairs involve contours that belong to Tones 1, 2, 4 and 5. The grouping of
more than 2 contours also takes place at a higher level compared with the pre
ceding results. Here, too, the subjects feel that all the variants of Tone 1
should be grouped. Two other groups are distinguished, viz. (6, 7, 10) and
(8, 9, 12, 13). These do not completely correspond to Halliday's classification,
but from the physical point of view these groupings are more plausible than the
heterogeneous mixtures produced by the English listeners.
Some considerations
It is worthy of note that the 2 groups of subjects are in accord on 2 points: (1)
they put all the variants of Tone 1 in one category: (2) they divide the 4 variants
of Tone 2 into 2 distinct groups, viz. (6, 7) and (8, 9). In view of the physical
differences between the 2 types of variants of Tone 2, this subdivision seems
justified and casts some doubt on Halliday's classification. The performance of
the Dutch subjects further suggests, that the distinction Halliday makes between
the variants of Tone 1+3 (contours 12, 13) on the one hand and some of the re
presentatives of Tone 2 (contours 8, 9) on the other hand, i$ open to doubt.
The classification of the Tones proposed by Halliday pretends to be "melodic".
Nevertheless the impression is that Halliday has allowed "functional" criteria to
come into play. In fact, the utterances that illustrate the use of Tone 2 are all
questions, while those that are examples of Tone 1+3 are all assertions. This
might explain why Halliday - perhaps unconsciously - makes a distinction between
(8, 9) and (12, 13), in spite of their melodic resemblance, whereas he groups
(6, 7, 8, 9) together, even though they represent 2 melodically different sub
groups.
The Dutch-speaking subjects make a classification that is more plausible from
the acoustical point of view. Confronted with a foreign language they concentrate
more easily on the purely melodic aspects of the utterances, while the English
72 subjects (and even Halliday to a certain degree) are diverted by the attention
they pay to the interpretative correspondences among the pitch contours.
Finally, in evaluating the subjects' performances we should bear in mind that in
this pilot experiment they were expected to distinguish not less than 7 hypothe
tically different intonation patterns. This may have been too difficult a task.
In previous experiments on Dutch intonation, not more than 3 hypothetical into
nation categories were included in the stimulus materials. This led to fewer
complaints and neater results.
Summary
Both English and Dutch-speaking subjects appear to be able to divide 20 different
English pitch contours into a smaller number of intonational categories. Their
performance differs to a certain extent as a function of their different linguistic
backgrounds. The categorisation proposed by both groups of subjects differs in
important respects from the classification of Halliday.
References
Collier, R. (1975) Perceptual and linguistic tolerance in intonation, IRAL, 1~,
p. 293-308.
Collier, R. and 't Hart, J. (1972) Perceptual experiments on Dutch intonation,Proc. 7th Int. Congr. Phon. Sci., Mouton, Den Haag, p. 880-884.
Halliday, M.A.K. (1970) A course in spoken English: Intonation, Oxford UniversityPre s s, London.
't Hart J. and Cohen, A. (1973) Intonation by rule: a perceptual quest, Journ. ofPhon., 1, p. 310-327.
't Hart, J. and Collier, R. (1975) Integrating different levels of intonationanalysis, Journ. of Phon., ~, p. 235-255.
Johnson, S.C. (1967) Hierarchical clustering schemes, Psychometrika, }~, p. 241-254.
73
The IPO speech squeezing systemS.M. Marcus
Introduction
A previous report has described a program for visual and auditory analysis of digitised
speech (Marcus, 1976). Some extensions to this program, OVID, will be described here
and its place within a flexible system for high-quality selective expansion and com
pression of natural speech. Some research applications will be outlined, together
with a new observation on the effect of segmental duration on pragmatic aspects of per
ceived voice quality.
OVIDOVID (Output and Visual Display) is a program for visual and auditory inspection of a
digitised speech file. It has been described in last year's annual progress report
(Marcus, 1976), since when two important extensions have been made.
First, in addition to the main display window, of from 10 to 200 msec of the digitised
amplitude-time waveform, a second upper display has been added showing one second
of the waveform and the position of the lower window within it. Because of the limited
speed of the display hardware, the upper display shows only every twentieth sample
from the 10,000 in the 1-second section of waveform (sampling is at 10 kHz). Despite
the poor resolution (due to aliasing beats between this effective 500 Hz sampling
frequency and the 5 kHz bandwidth speech signal) the upper display greatly facilitates
program operation. Figures 1, 2 and 3 illustrate typical OVID displays. The displayed
digits indicate the number of samples from the beginning of the window to the vertical
bar cursor, and the amplitude of the indicated sample. Thus the single period of vocal
cord vibration marked by the cursor in Fig. 1 is 80 samples long; at the sampling fre
quency of 10 kHz this is 8 msec and corresponds to a frequency of 125 Hz.
A second facility essential to the implementation of the system which will be des
cribed in the next section is the ability to assign labels to points in the wave
form. These labels are any two characters from the teletype keyboard. Their nature
and location can be assigned arbitrarily, but they are used here to give a crude
phonemic coding to the digitised waveform. Such events as vowel onsets are not al
ways easy to determine, but the visual and auditory inspection facilities of the
program are a powerful aid to the ear of the user.
'Speech squeezing'
A number of systems at the institute have used channel vocoders for speech pro
cessing. One of these is the "Ritmator", which allows changes in segment duration
to be made to speech (Willems and de Jong, 1974). Unfortunately, the quality of
speech processed in this way can be described as intelligible rather than natural.
74 Although more recent work with LPC-vocoders (Vogten, this issue) has resulted in
/PO annual progress report /2 /977
enormous improvement, it was felt desirable for experiments on speech timing to keep
as close as possible to the original speech quality, and therefore to modify the
original (digitised) amplitude-time waveform as little as possible.
1-------------------1sec------------------------1
.. ......... :.. .. .. .." ,.. ""
"..,,:; : ~"":':"J'" ~~: t'., 11:--i;....,!.~ ;i'!i-.l~:.::...~~." Ij... ,. ••• .. .........t.... .. .. ..
1
~.......
-,;
... .
... .-...,.....
. -"""'. ~......... .
1-------------------25msec ----------------------1
CURSORx
80y
102~TIME
~~-----~~t:~, "
1---------12msec---------1
~~;~~~;S:~~:~",fi~\~:L~"~>:~;;~" ,~, "
, ....", ............." ," ......
" ' ...
r.
~<;\':<i~;,··lv:,~'.\·w.,\\~\~\\\1;\ji~;i~;~\\~.\\\~\Jl\ I ; i I I '
________200msec--------1
!:IS 3DB -l
Fig. 1, 2 and 3. Examples of OVID displays illustrating the range of time scalesavailable. 7S
Given the location of comparable points in each pitch period, speech may be compressed
or expanded by deletion or duplication of whole periods. Providing the start and endpoints are carefully chosen to minimise discontinuities, the resulting speech is comparable in quality to the original digitised natural speech. Fig. 4 illustrates how
this process might be used to compress speech by a factor of 2/3 by omitting onepitch period in three.
ORIGINAL AMPLITUDE-TIME WAVEFORM
COMPRESSED WAVEFORM (.2/3 )
,........: .::
."\.,
!:\. \,...'V........,.
~.', .'".: ..,,'""'\.,.......:..
"'.\Of": \,... ,_"•.,..,..,\ :
.'"
.-','
..10msec
' .•
'.
...
Fig. 4. Pitch synchronous compression of a section of speech by a factor of two thirds.
The output of OVID gives the locatiofr and duration of pitch periods, and of similar
duration segments of unvoiced speech and silent intervals which can be compressed in
the same way. The user also has to indicate segments which may under no circumstances
be duplicated or deleted, such as stop bursts or periods involving large amplitude
changes. Segment classification is performed manually by the user, the different seg
ment types being indicated by means of a set of pushbuttons (see Marcus, 1976), and
takes something in the order of 30~60 times real time, depending on the pitch of the
speaker's voice.
Since only whole segments may be duplicated or deleted, the problem of changing the
duration of a section of speech by a desired amount requires an optimum choice out of
the possible segments.
The requirement of attaining a desired duration may conflict with the desirability of
distributing changes as evenly as possible throughout the original waveform. The so
lution adopted is to use a random number generator to select segments and then impose
the requirement that, if that segment is changed, the new section duration must be
closer to its desired value (absolute) than before the change. This continues un~il
a criterion of proximity to the target duration is reached.
In order to make selective changes in segment duration, the labelling facility im
plemented in OVID is used to indicate those sections which should be modified. An im
portant consideration was that in many cases a number of different changes might be
76 attempted on the same sets of segments, which might not lie in a sequence. A simple
experimenter-oriented control language was therefore devised which first allows a
"structure" to be assigned to an utterance, and tlilen permits changes to be made to
elements of this structure. Fig. 5 illustrates an interaction with the program.
First the name of an OVID data file containing both segmental (periodic) and label
information is given. The program indicates that this is information over speech file
SQt001 and then types the labels, the two characters one above another. The user
must then type one teletype character under each label. Each teletype character
defines a group on which operations will be carried out separately. In the example,
all the consonant-like sections have been assigned to group C and all the vowel-like
to group V. Possible operations are to change each section by a specified ratio, by
a specified duration, or to a fixed duration (which may be zero). When all operations
have been given, the command GO results in a new speech file being constructed and
output. With the current hardware this takes about 3 times real time. An optional
print-out gives each label, the group to which it has been assigned, its onset time
in the new version (in seconds), its original duration, its new duration and the
error between this obtained value and that required by the operation on sections in
that group (see Fig. 5.).
Applications
The system provides a powerful tool for experiments with virtually optimum quality
changes in speech rate. Unlike most systems which involve direct modification of the
digitised speech waveform, the two-stage process of analysis and "squeezing" coupled
with the group structure of the experimenter-oriented control language gives great
flexibili ty.
A simple application is the simulation of hardware speech compressors of varying de
grees of sophistication - those involving deletion of fixed duration segments, seg
ments lying between two zero crossings and true-pitch synchronous compressors - and
assessing the trade-off between the cost of such sophistication and intelligibility
and quality.
Monitoring tasks are providing valuable insight into the contribution of many acoustic,
phonetic, linguistic and extralinguistic factors in the real-time comprehension of
speech. Tasks involve measuring the reaction time of subjects in responding to a pre
specified target. Target types which have been used include phonemes (Foss, 1969),
syllables, words and phrases (McNeill & Lindig, 1973), words, rhymes and semantic
categories (Marslen-Wilson and Tyler, 1975) and mispronounciations of single pho-
nemes in words (Cole, 1973).
It is now clear that in the understanding of continuous semantically organised speech,
even monitoring of the first phoneme in a word involves word recognition (Morton and
Long, 1976). We can therefore turn with interest to experiments showing the effects
of rhythmic structure, that is relative timing of sentence stress patterning, on pho
neme monitoring times to real words (Cutler, 1976; Cutler and Foss, 1977) and possi
bly also to similar results for nonsense words (Shields, McHugh and Martin, 1974),
and ask further what effect the actual temporal structure, rather than relative
stress patterning, has on speech perception. It is clear that the system described
above provides the ideal tool for producing the required manipulations in temporal 77
structure. Furthermore, digital speech processing allows the location of synchroni
sation pulses used for reaction time measurement to be precisely determined; the
precise identity of stimuli, differing only by the experimental manipulation, reduces
the danger of artefactual results through differences in two human versions of the
"same" sentence.
! DATA LAUGH
SQ~OOl
£ HELAFTENLAFTILHIZ BELEWIGELAKDELE£ E A A E I E~ eveveevceveevecvexevcvevevevcevcv
e.3/4v.4/3GO
££ ~ TIME OLD NEW ERRH e .000 .034 .025 .000EE V .026 .059 .076 -.002L e .102 .056 .046 .004AA V .148 .231 .307 -.001F c .453 • 112 .082 -.001T c .5)5 .0)5 .029 .002E V .564 .060 .080 .000N e .645 .094 .070 .000L c .715 .058 .041 -.002AA V .756 .224 .295 -.002F e 1.050 .164 .119 -.00)T e 1.169 .052 .037 -.001I V 1.206 .045 .059 .000L e 1.267 .059 .042 -.001H C 1. )09 .072 .05) .000I V 1. )62 .086 • 112 -.00)Z e 1.474 .085 .066 .002
X 1.541 .085 .085 .000B e 1.625 .040 .0)) .00)E V 1.658 .066 .088 .000L e 1.746 .041 .0)) .002EE V 1.779 .121 .164 .00)W e 1.943 .140 .107 .002I V 2.050 .076 .099 -.002G e 2.149 .047 .0)9 .003E V 2.188 .1)2 .176 .000L e 2.363 .141 .108 .002AI V 2.702 • 111 .085 .002DZ e 2.787 • 112 .08) .000E V 2.870 • 119 .163 .00)L c 3.0)) .0)7 .027 .000EE V ).061 .188 .255 .004
FILE SQ~100
QUIT
Fig. 5. Example of program interaction in modifying the phrase "He laughed and laugh'til his belly wiggeled like jelly." Here consonants are shorted to 3/4 oftheir length and vowels extended by one third. The silence before "belly'"is left unchanged in duration.
Finally, an interesting observation has arisen from selective changes in consonant
and vowel duration (Marcus, 1977). Although "uniform" compression of speech produced
by the above system does not produce vowel reduction (Lindblom, 1963), relative
changes in segment duration (Karlsson and Nord, 1972), nor take account of invariances
78 in segment duration important in perception (Marcus, in press), it appears that
large variations in linear speech rate can be made with little perceived change in
naturalness or speaker quality. This contrasts sharply with the results of relative
changes in duration of consonant-like and vowel-like segments. If the relative du
ration of "vowels" to "consonants" is increased by 50%, the overall affect is that
of a very lazy, unmotivated speaker. However, if the reverse is performed the affect
is of a stiff, tense, authoritative voice. This occurs even though overall speech
rate, relative timing, amplitude and intonation is the same for the two utterances.
Thus a change in the pragmatic interpretation of an utterance has arisen from fine
grain segmental timing, a result problematic for models in which prosodic infor
mation is seen as extracted from global cues, segmental analysis being only a final
stage preceding segment classification.
Summary
A system based on semi-automatic pitch-synchronous editing of natural speech is
described. It makes experimental manipulations of either local or global speech
rate easy to produce, and the quality of the manipulated speech is of a similar
standard to the original.
References
Cole, R.A. (1973) Listening for mispronounciations: a measure of what we hear duringspeech, Perception and Psychophysics, 11, p. 153-156.
Cutler, A. (1976) Phoneme monitoring reaction time as a function of preceding intonation contour, Perception and Psychophysics, lQ, p. 55-60.
Cutler, A. and Foss, D.J. (1977) On the role of sentence stress in sentence processing, Language and Speech, lQ, p. 1-10.
Foss, D.J. (1969) Decision processes during sentence comprehension: effects oflexical item difficulty upon decision times, Journal of Verbal Learning andVerbal Behaviour, ~, p. 457-462.
Karlsson, I. and Nord, L. (1972) Stops and CV segment duration, International Conference of Speech Communication and Processing, Bedford, Mass., Paper F5,p. 210-213, New York, IEEE.
Lindblom, B. (1963) Spectrographic study of vowel reduction, Journal of the Acoust.Society of America, 35, p. 1773-1781.
McNeill, D. and Lindig, K. (1973) The perceptual reality of phonemes, syllables,words and sentences, Journal of Verbal Learning and Verbal Behaviour, 11,p. 419-430.
Marcus, S.M. (1976) OVID - a further tool for speech perception studies, I.P.O.Annual Progress Report, 11, p. 31-33.
Marcus, S.M. (1977) The IPO speech squeezing system, Presented to Sussex meetingof the Institute of Acoustics Speech Group, July 1976.
Marcus, S.M. Distinguishing "slit" and "split" - an invariant timing cue in speechperception, Perception and Psychophysics, in press.
Marslen-Wilson, W.D. and Tyler, L.K. (1975) Processing structure of sentence perception, Nature, 257, p. 784-786.
Morton, J. and Long, J. (1976) Effect of word transitional probability on phonemeidentification, Journal of Verbal Learning and Verbal Behaviour, 12, p. 43-51.
Shields, J.L., McHugh, A. and Martin, J.G. (1974) Reaction time to phoneme targetsas a function of rhythmic cues in continuous speech, Journal of ExperimentalPsychology, 102, p. 250-255.
Willems, L.F. and De Jong, Th.A. (1974) Research tools for speech perception studies,I.P.O. Annual Progress Report, ~, p. 77-81.
79
80
Visual perception
Spatial processing of small visual stimuliF.J.J. Blommaert
Introduction
During the past few years much experience has been gained at this laboratory in using
a perturbation technique for gathering information on dynamic properties of the
human visual system (Roufs and Blommaert, 1975; Roufs and Pellegrino, 1976).
In the field of spatial processing of details, Kulikowski and King-Smith (1973)
and Hines (1976) used subliminal summation successfully for measuring line- and
edge spread functions of the retina. They found that the visual system operates
linearly with respect to processing on threshold level of one distinct feature
like a line or an edge. M0reover, line-spread functions measured at eccentricities
of 1.250
and 2.50 0 indicated that the visual system, on the whole, operates in
homogeneously. Therefore lines with their spatial extensiveness do not seem very
suitable as a probe stimulus in experiments where local properties of spatial pro
cessing are being investigated. We propose that for this purpose a small point-shaped
stimulus is the obvious means.
In this paper a first experiment is described in which we tried to measure a point
spread function of the visual system. Results are reported of a second experiment
investigating the linearity of spatial processing in a small area of the retina.
Theory
The perturbation technique is based mainly on determining changes in the threshold
value of a point-shaped stimulus due to perturbation of its response caused by the
response of a faint subthreshold stimulus with properly chosen shape.
We take it that detection of quasi-static stimuli can be formalised by using a peak
detection model, i.e. a stimulus is seen if the extreme value of its response
U(x,y) exceeds a certain level D. The threshold condition may then be written as:
extr {U(x,y)} D
Furthermore, we assume that within a small area of the fovea:
the retina is homogeneous and circularly symmetrical
- the processing is quasi-linear
the extreme value of the response of an infinitesimally small point source
coincides with the coordinates of stimulation.
The response to an arbitrary stimulus may now be written as a convolution integral
with a local point spread function Uo(x,y):
U(x,y) = [Z Uo(x-x' ,y-y')€(x ',y )dx dy
Here, €(x,y) is the distribution of retinal illumination of the stimulus and U(x,y) 81
IPO annua 1 progress report 12 1977
is the response of the visual system.
As in this paper we deal only with circularly symmetrical stimuli, it is convenient
to use polar coordinates:z7f
U(r,¢) = (o C1)
For a small stimulus approximating a point source, with retinal illuminance Ep and
radius r o ' the response pattern becomes
z7f r oUpCr,¢)= Ep f fr'uoClr-r'l)dr'd¢'
o 0
Due to the basic assumptions for determining threshold, only the response for r = 0
is of interest and, so
U Co)p
z7f r oE f fr'UoCr') dr'd¢'
p 0 0C1a)
~EpAp UoCo) if r o is sufficiently small. Here, Ap is the area of the point.
The threshold condition of a point may now be formulated as
D (2)
82
If perturbation of a point response is applied with an arbitrary shape, eq. 2
changes into
Here: UpertCo) is the response in the origin of the perturbating stimulus. Ep,pert
is the retinal illuminance of the point source necessary for detection of the combination.
Of course, eq. 3 is valid only if the retinal illuminance of the perturbation is so
small that detection still takes place by the extremum of the point response.
For the actual experiments we chose perturbation with two differ~nt shapes as shownin Fig. 1.
In the first experiment we used an annulus with radius r a and width ~ra' It can easily
De verified from eq. 1a that for ~ra«ra its response in the origin can be approximated
by
EIf a/ = q, where E ais the threshold of the combination point-annulus, we canEp, a p,
write for the threshold condition of the combination
E {A U~Co)+qA U~Cr )}=Dp,a p u a u a
retinal 1illuminance
-x
retinal 1Illuminance
Fig. 1. Schematic representation of the stimulus configurations used. At the left, apoint-shaped stimulus with perturbating annulus of radus r and width ~r , representing the collection of all points at distance r from the s~imulus. At th~ right, thepoint-shaped probe stimulus, together with a d!sc of radius r d (a kind of negativegoing edge at distance r d from the stimulus).
Comparing the thresholds with and without perturbation, £ and £ , we are able top,a pderive for the normalised point spread function Uo*(r a ):
A £---.E..- {-.E. - 1}
qAa Ep,a(4)
multiplying Uo* by its absolute value
computed from eq. (2).
we can calculate one discrete value of
We can again find the absolute response by
Urr(o) expressed in "D"-units, which can be
So, by measuring the thresholds £ and £p p,athe point spread function. By varying the radius r ais possible to find a number of points for Uo* (ra ).
of the subthreshold annulus it
The second perturbation shape, a disc with radius rd' was chosen in order to check
the basic linearity- and peak detection assumptions.
It can be seen from eq. 1 that for the response pattern of a uniform disc at r=o
can be written (subscript "d" for disc):
If £d q, this leads to the threshold conditionI£p,d
From the threshold£
change p/£ d it can be derived thatp,
A £
-.E.{ -...£ - 1 }q £p,d
(5)
By using discs with different values of r d , we will find a discrete number of samples
for F*(rd), which has a definite relationship with the normalised pointspread function
Uo * (r) as expressed in eq. 5. 83
84
The experimental results may be interpreted as the central response of the retina to
a disc with increasing radius or almost as the central response of the retina to a
negative-going edge with increasing distance from the origin.
The unique relation between F*(rd) and Ua *(r) expressed in eq.5 enables us to check
the above-mentioned basic assumptions, such as quasi-linearity and peak detection incombination.
Methods
The subject viewed an 11 0 uniform field with retinal illuminance of E = 1200 troland
monocularly. Both stimulus- and perturbation shape were projected on top of this
field by using prisms. To facilitate fixation, four fixation lights were projected
around the stimulus in a circle with a radius of 10. A 2mm artificial pupil with an
entoptic guiding system was used. The lights were generated by linearised glow
modulators. The time functions used as an approximation of quasi-static presentation
of the stimuli consisted of blocks of about 300 ms the beginning and the end of
which were smoothed to avoid transient phenomena.
The subject had one knob to release the stimulus, which was delayed by a convenient
preset time interval. Three knobs enabled him to answer with "yes", "no" or "reject".
All thresholds were 50% probability values obtained by a modified "method ofconstant stimuli".
Results
Fig. 2a shows the normalised point spread function for subject FB. The absolute res
ponse, expressed in "D"-units, can be found by mUltiplying the reduced values by the
extreme value given in the legend under the code "norm. constant". This constant is
obtained by averaging the threshold E of the point alone over all 18 sessions of bothp
experiments. In this way we try to acquire an optimal representation of the sensi-
tivity during the whole experiment (a fuller explanation of the statistical procedure
followed will be given elsewhere). Per session, one point of the curve was measured,
each point being calculated from the average of 8 pair quotients according to eq. 4.
The experimentally determined standard deviation of the mean is indicated. The order
of measurement with respect to the r-axis was randomly chosen.
In Fig. 2b experimental data of F*(rd) according to eq. 5 are shown. They were
measured during 11 sessions, making an average of one point per session. All other
conditions were the same as for the point-spread function experiment. The dashed
curves are results of a simultaneous fitting to all experimental data in such a way
that the relation of the two functions exactly obeys eq. 5. The curves were obtained*by averaging the Fourier components of Fig. 2a and those of ~~ in Fig. 2b (the dif-
ferentiation was carried out in the frequency domain), followed by backward trans
formation to ua*(r) and FX(r), respectively.
norm. canst. =6.17 .. 10-3 td-'min-2
subj. FB
E = 1200 td
7~ra
(min. of arc)
65~2 3 ~\ /,-.(
,\\\\
1\\\\\
o.
!
o 1 3 4 5 6 -----1rd(min. of arc)
Fig. 2a and b. Experimental data of a) normalised point spread function Un(r a ). Theymay be interpreted as the response of the visual system to a point of unit energy.b) F*(r ) of eq. 5. It can be interpreted as the response to a disc at its centre,with raaius r of the disc as parameter. The dashed curves are the result of asimultaneous ~omputerfit. They exactly obey the relation between F*(rd) and Uo*(ra )as dictated by eq. 5.
Discussion
It is clearly seen from Fig. 2 that the responses measured are fairly large in com
parison with the spread. The standard deviation, however, varies considerably between
the different data points as a consequence of changes in q- and A-values for the
various perturbations. From eqs. 4 and 5 it can easily be verified that different
q- and A-values lead to different threshold-spread effects on the experimental results.
Within measuring precision, all experimental points fall close to the dashed curves,
which satisfies the basic linearity and peak detection postulates. Other provisional
assumptions, such as circular symmetry, were not tested within the experimental design. 85
86
The obtained point spread function is found to be in accord with the results of Hines,
Kulikowski and King-Smith on the shape of Uo*(r), local linearity and peak detection.
The extensiveness of our point spread function (or line spread function if computed
from it) is somewhat less (zero-crossing at 2 min. of arc, whereas for measured line
spread functions this is about 3 min). A possible explanation of this difference may
be the circumstance that the line spread functions were obtained at retinal illumi
nances of the background that were much less than 1200 td.
It may also be due to differences in choice of the probe stimulus, i.e. a point
versus a line. Where a point is used as probe stimulus, the influence of retinal
inhomogeneities is reduced to a minimum.
Summary
A perturbation technique was used for measuring a point spread function of the visual
system. In combination with an experiment in which the central response to discs with
increasing radius was determined, basic assumptions such as quasi-linearity and peak
detection were tested and confirmed. The extent of the validity of other provisional
assumptions, like circular symmetry and local homogeneity are still to be investigated.
References
Hines, M. (1976) Line spread function variation near the fovea, Vision Res., ~,
p. 567-572.
Kulikowski, J.J. and King-Smith, P.E. (1973) Spatial arrangement of line, edge andgrating detectors revealed by subthreshold summation, Vision Res., 2l, p. 1455-1478.
Roufs, J.A.J. and Blommaert, F.J.J. l1975) Pulse and step response of the visualsystem, I.P.O. Annual Progress Report, ~, p. 60-67.
Roufs, J.A.J. and Pellegrino, J.A. (1976) Gain curve of the eye to subliminal sinusoidal modulation of light, I.P.O. Annual Progress Report, ll, p. 56-63.
Visual recognition by dyslectic children: response latencies for lettersand words
H. Bouma, Ch.P. Legein and ALM. van Rens
Introduction
We have, over a number of years, been carrying out investigations on deficiencies
underlying the poor reading of dyslectic children. The general aim of this research
has been to understand the poor reading in terms of a number of component processes
and mutual relationships between these processes. In particular we have studied
visual word recognition, since this is so obviously poor in dyslectics. One of our
early results was that dyslectic children and normal readers of the same age (control
group) recognize isoZated letters with the same high degree of accuracy from brief
(100 msec) presentations. This indicated to us that the weak readers had an adequate
knowledge of letter forms. On the other hand, in the recognition of embedded letters,
dyslectic children scored much lower than control children. This we interpreted as
increased interferences between adjacent letters. Low recognition scores for words
also distinguished the dyslectic group from the control group and, from an analysis
of individual results, it seemed likely that the poor recognition of words was due
to poor recognition of embedded letters, rather than to inadequate word knowledge
(Bouma, Legein and van Rens, 1974, 1977).
Earlier, we had already shown that in normal adult readers, interferences between
adjacent letters limited word recognition, in particular in the parafoveal visual
field determining the horizontal span of vision or visual reading field (Bouma, 1973).
Also among children, both dyslectics and controls, we had observed a substantial de
crease of recognition scores for embedded letters and words (but not isolated letters),
in parafoveal presentation.
More recently, we have concerned ourselves with time factors in visual word recogni
tion. During fluent reading, eye movements, word recognition, activated memory con
tent and comprehension should have close time relationships. A disruption of rela
tive timings could be just as disastrous for ongoing reading as a deficiency in just
one of the separate factors, such as word recognition. In this spirit, we reported
that presentation of long words in two successive parts gave higher correct scores
than sinruZtaneous presentation in spatially separated parts (Bouma, Legein and van
Rens, 1976). The importance of time factors was gathered from response latencies
in the visual recognition of letters and words, which we will report here.
Experiment
The experiment was concerned with the recognition of isolated letters, letters em
bedded between two letters x, and single well-known Dutch words. These stimuli were
presented one at a time for 100 msec either in foveal or in slightly parafoveal
vision (about 10 eccentricity or 4 letter positions from fixation). The subjects
responded by naming the letter or the word seen. An electronic counter was started
at the onset of the stimulus and stopped by a voice switch reacting to the initial 87
IPO annua 1 progress report 12 1977
vocalization of the response. By listening to the tapes of the experiment, a number
of artefacts of the voice switch were eliminated. It is difficult to eliminate all
such artefacts (see also the contribution of Schroder to this issue), but we expect
latencies averaged over some ZO responses to be accurate to at least 50 msec - slight
differences are less interesting anyway.
Subjects were ZO dyslectic children and ZO control children aged between 10 and 15
years. Both groups were as described in earlier reports to which we also refer for
experimental details (Bouma, Legein and van Rens, 1974, Bouma and Legein, 1977).
Foveal stimuli preceded parafoveal stimuli and for each stimulus position, isolated
letters, embedded letters and words were presented in that order.
Results and discussion
Average correct scores for the six conditions are presented in Table 1. Essentially
the results are similar to our earlier findings and will not be discussed here.
Average response latencies for correct responses are presented in Table Z. The
average differences between the dyslectic group and the control group have a clear
cut pattern: the dyslectic group is about 100 msec slower in correct letter responses
and ZOO msec slower in correct word responses.
dyslectic control
Foveal isolated letters 97 96 %
Parafoveal isolated letters 94 96 %
Foveal embedded letters 74 91 %
Parafoveal embedded letters 41 56 %
Foveal words 88 98 %
Parafoveal words 58 77 %
Table 1. Correct response percentages asaveraged over about ZO subjects.
dyslectic control
Foveal isolated letters 900 780 ms
Para foveal isolated letters 840 740 ms
Foveal embedded letters 930 830 ms
Parafoveal embedded letters 1050 920 ms
Foveal words 890 680 ms
Parafoveal words 940 730 ms
Table 2. Response latencies of correctresponses as averaged over about 20subjects (No<400-800).
For a possible interpretation, we schematically divide the time between stimulus
onset and response into three serial types of process: visual recognition (including
decision between alt~rnatives), phonemic recoding, and the actual spaaking (Fig. 1),
and collect arguments for the allocation of the observed time differences to the stages.
Working backwards, we know of no evidence that dyslectic children are slower speakers,
and in a situation where they had to repeat spoken words, even long and difficul~
ones, we did not notice much difference between dyslectics and controls. This would
make it less likely that the delays are due to slower speaking. What about the pho
nemic recoding? On the assumption that words require a more complex recoding than
letters, the different delays between letters (100 msec) and words (200 msec) could
perhaps be sought in the phonemic recoding part. However, there are other relevant
observations. Incorrect responses (not shown in Table Z), which involve longer laten
cies for both groups of subjects, show much greater differences between the groups
88 than correct responses. We think it unlikely that, once a stimulus has been incorrect-
ly recognized, it would take much longer to produce a phonemic recoding than it
would for correctly recognized stimuli. Thus, if the cause of incorrect recognition
lies in the visual recognition part, the extra delay cannot easily be allocated to
the phonemic recoding part. This leaves us then with the visual recognition part as
a likely candidate for the delays, and the differences between letters and words
should then also be primarily attributed to the recognition (and decision) part.
AUDITORY
RECOGNITION
VISUALI
PHONEMIC I VOCAL,~
,~
RECOGNITION RECODING RESPONSE
:I
t
Fig. 1. Block diagram of assumed serial stages in the oral reading of letters andwords.
What can we say about the consequences of the observed delays for reading? Since we
know little about timings of component processes in reading, one cannot be definite
on this problem. There is some recent evidence on the time needed for triggering
eye movements in reading, which, in adult readers, indicates that word recognition
in reading takes no more than 200-300msec (Rayner and McConkie, 1976). This corres
ponds well to response latencies for visual recognition of single words by adults,
which can be as low as 300-400msec. Even the normal reading children, however,
remained well above this value. What we can do then is to compare the extra delay
of dyslectic children with the normal recognition time estimates, with normal response
latencies, or with normal durations for eye fixations (lS0-S00msec).
In order to avoid basing this discussion on averages over subjects alone, Fig. 2
correlates individual latencies for foveal word recognition for the individual
subjects with reading level. The rather wide range of individual differences does
not hide the fact that, for a number of dyslectic children the delays are of the
same order of magnitude as the comparison durations for recognition and eye fixations
mentioned above. The conclusion can only be that such delays are incompatible with
normal reading, and that even with a full adaptation of all other component reading
processes to the delayed recognition, slow reading seems the maximum attainable.
On the other hand, some dyslectic readers have response latencies quite similar to
their normal reading peers and their weak reading should have a different background.
Referring to our earlier work in which we showed a number of functions subnormal in
dyslectics (Bouma, Legein and van Rens, 1975), we can only repeat our statement that
detailed insight into the component processes and their interaction in time is
badly needed.
89
grade7
6
5
Q) 4>Q)-l
Cl 3.5't:lIIIQ)
a:: 2
1
o control children
• weak readers
0 0 0 0
86 0 0
000
" "
0 0
• ... • •eo • •
•- -• • •
••
o .5 1.0 1.5 sec.
90
Response Latency
Fig. 2. Reading level versus response latency for foveal words. Note thediversity of response latencies particularly among weak readers.
Finally, let us take a brief glance at some other interesting comparisons in table 2.
Roughly, embedded letters take longer than isolated letters and words, and parafoveal
stimuli take longer than foveal stimuli. This cannot corne as much of a surprise, but
it seems relevant to note that explicit recognition of certain embedded letters cannot
be taken to precede word recognition. Explicit recognition includes decision time
between likely candidates, which is unnecessary for component letters in a word recog
nition task (Bouma and Bouwhuis, 1975). Thus the possibility remains fully open that
word recognition proceeds through prior letter activations. It is puzzling that foveal
latencies for isolated letters are so high, in fact higher than parafoveal isolated
letters. Since the foveal isolated letters carne first in the session, it could perhaps
be a training effect. However, such an effect was not evident from comparison of the
first and the last responses in the list.
Summary
Response latencies for visually presented letters and words have been measured for 20
weak readers (dyslectics) and 20 normal readers, 10-15 years. In their correct responses, many weak readers are consistently slower, the group averages differing by
100 msec for letters and 200 msec for simple words. The extra delay should probablybe attributed to the recognition process itself rather than to phonemic recoding or
speaking processes. Since the extra delays are of the order of fixation pause durations,they can be taken as disruptive in reading.
Acknowledgement
Thanks are due to Messrs A. van Vroenhoven and J. Hupperetz, directors of the schools
involved, and to their staffs, for their kind cooperation.
References
Bouma, H. (1973) Visual interference in the parafoveal recognition of initial and finalletters of words, Vision Research, 11, p.767-782.
Bouma, H. and Bouwhuis, D.G. (1975) Word recognition and letter recognition, I.P.O.Annual Progress Report, ~, p. 53-59.
Bouma, H. Legein, Ch.P. (1977) Foveal and parafoveal recognition of letters and wordsby dyslectics and by average readers, Neuropsychologia,li, p.69-80.
Bouma, H., Legein, Ch.P., van Rens, A.L.M. (1974) Visual recognition by dyslecticchildren:a study on letter and word recognition in foveal and parafoveal vision in20 weak readers and 20 normal readers. I.P.O. Annual Progress Report, ~, p.l04-109.
Bouma, H., Legein, Ch.P., van Rens, A.L.M. (1975) Visual recognition by dyslecticchildren: further exploration of letter, word and number recognition in fourweak and four normal readers, I.P.O. Annual Progress Report, ~, p.72-78.
Bouma, H., Legein, Ch.P., van Rens, A.L.M. (1976) Visual recognition by dyslecticchildren: spatial and temporal separation of wordhalves as recognition aids fordyslectic and control children. I.P.O. Annual Progress Report, 11, p.64-68.
Rayner, K., McConkie, G.W. (1976) What guides a reader's eye movement? Vision Research,.l..§., p.829-837.
91
Backward masking in a reading-like situationu.o. SchrOder
Introduction
In reading, the eye jumps along a line of ~ext, and each jump (20-60 msec) is
followed by a fixation pause of 100-500 msec. It is assumed that during this pause
the information is extracted from the text.
From visual research a phenomenon is known called backward masking, that is when
a given stimulus is followed after a short time by a second one, the first (test-)
stimulus is harder to detect, or even go undetected. It is for this effect of
masking that we call the second stimulus "the mask". The effect of backward masking
depends on the kind of stimulus and mask, and the S.O.A. (Stimulus Onset Asynchrony).
It seems that in more complex stimuli/mask paradigms there must be a greater S.O.A.
to enable an "escape" of the test stimulus. For a two-, four- and six-letter string
followed by a pattern mask Zamansky et al. (1971) for instance found a threshold
S.O.A. of 35, 55 and 100 msec respectively, (computed from their Fig. 2, trailing
mask at 40 cd/m 2). For a letter array with a circle mask, Averbach and Sperling
(1961) found a degraded performance for S.O.A.s of 100-200 msec.
~e asked ourselves if, in a very complex situation like reading or scanning text,
the backward masking would extend to some 150 ms, degrading the incoming information
and thus limiting high speed reading or scanning of text. In this paper we report
on some experiments to establish the extent of backward masking.
Experiments
In the first experiment the stimulus and mask were Dutch three-letter words (fre
quency of occurrence> 10- 5), the stimulus and the mask were presented at over
lapping places at one degree left or right of a fixation point. In a second experi
ment the mask was replaced always by the same 32-letter-Iong sentence consisting of
unpronounceable (nonsense) words.
From both experiments correct scores and response latencies were obtained. Fitting
the response latency data to a normal distribution on both a linear and logarithmic
time scale, the logarithmic time scale gave the best fit. The main response latency
and corresponding standard deviation of each subject is therefore computed on a 10
garithmictimebase and not, as usual, on a linear one.
It was observed further that the standard deviation was proportional to the main
response latency, this observation fits well with a logarithmic transform of the
time scale.
The assumed underlying distribution of the response latencies is used to filter
the data to some extent. After a logarithmic transformation, mean and standard
deviation (s) are calculated, then data outside the interval mean + 1.7 s are
deleted and mean and standard deviation are computed again. We are following this
92 procedure to exclude response latencies unrelated to what we wish to investigate.
IPO annua 1 progress report 12 1977
This cleaning up in both theory and practice corresponds to deleting 5% of the data,
a method which was developed with the aid of computer plots of the data.
Procedure
The stimuli were presented in blocks of 40. In table 1. are shown the conditions
used in these blocks; within the blocks conditions varied in random order, the
only difference between the conditions with and without a mask being the presence
or absence of black letters in a masking field.
Condi tion S.O.A. Stimulus Masking Field
1 140 ms 20 ms 200 ms without a mask2 140 ms 20 ms 200 ms with a mask
3 100 ms 20 ms 200 ms without a mask
4 100 ms 20 ms 2DO ms with a mask
Table 1. The four conditions which were mixed in one experiment.
Results of string-mask experiment
In this experiment seven untrained subjects participated. The results are summarised
in table 2. From seven subjects we obtain seven mean response latencies and seven
standard deviations. Averaging the seven mean response latencies yields one
"averaged mean response latency" and a corresponding standard deviation; averaging
the seven standard deviations yields one "average standard deviation" and a cor
responding standard deviation.
The influence of a three-letter mask at the same place as an earlier presented three
letter stimulus is visible as a decrease in the percent-correct answers and an in
crease in response latency. The extent of this influence depends on the S.O.A.
(Fig. 1); for a S.O.A. of 100 ms the decrease of the percentage correct is 27%, the
increase in response latency of the correct answers being 16% (130 ms delay of the
correct answers).
~ decrease ~ increase40 in%C(%) 40 inRT(%)
30,
30~word\ mask
20 \\ 20string '~ <1 word mask
10 mask \'10
," string~\
\ mask \'0 010 100 1000 10 100 1000
SOA(ms)~ SOA(ms)~
Fig. 1. The percentage correct score and the response latency are a function of theS.O.A. between stimulus and mask. In the figure decreases in correct score (%C) andincrease in response latency (RT) as compared to the nonmasked condi tion are shown. 93
Condition S.D.A. Mask Average correct Averaged mean res- Averaged individualscore ponse latency of the standard deviation of
correct responses the correct responselatencies
1 140 ms no 57% (14 %) 719 ms (237 ms) 19% (11 %)2 140 ms yes 50% (20 %) 706 ms (226 ms) 17% ( 5 %)3 100 ms no 73% (16 %) 722 ms (238 ms) 17% ( 8%)4 100 ms yes 38% (18 %) 848 ms (200 ms) 19 % ( 6 %)
Table 2. Results of three-letter word maskingNumbers between brackets are standard deviations.
Condition S.D.A. Mask Average correct Averaged mean res- Averaged individualscore ponse latency of the standard deviation of
correct responses the correct responselatencies
1 140 ms no 76 % (15 %) 646 ms ( 112 ms) 16% ( 4%)2 140 ms yes 70 % (14%) 649 ms (125 ms) 15% ( 3%)3 100 ms no 80 % (13%) 648 ms ( 114 ms) 14% ( 2%)4 100 ms yes 56% (22%) 702 ms (145 ms) 18 % ( 4%)
Table 3. Results of nonsense string maskingNumbers between brackets are standard deviations
Results of word-mask experiment
In this experiment 12 trained subjects participated, their results being summarised
in Table 3. Masking with a long sentence consisting of nonwords, a S.D.A. of 140 msagain seems to have little influence on the perception of the stimulus. An earlier
presentation of the mask (S.D.A. 100 ms) decreases the percentage of correct score
by 22% and increases the response latencies of the correct responses by 8% (52 msdelay of the C01'rect answers).
The results of the masked/nonmasked comparison at a S.D.A. of 100 ms are presented
in Fig. 2 as an example of the experimental results.
40
30
20
10
o1 2 3 4 5 6 7 8 9 10 11 12
~ subject
SOA=100ms
SOA=100ms! increasel' inRT(%)
10
o1 2 3 4 5 6 7 8 9 10 11 12~ subject
Fig. 2. There is a great variability between the results of subjects for both the94 decrease in percent correct score (%C) and the increase in response latency (RT).
DiscussionInfluence of stimulus onset asynchrony
From these experiments it can be concluded that a mask presented at a S.D.A. of
140 ms after the stimulus does not really affect the perception of the stimulus; at
a S.D.A. of 100 ms there is marked degrading of the perception of the stimulus.
This holds good for both a three-letter word mask and a long-sentence mask consisting
of nonsense words.
Differences between the two experiments
The differences between the two experiments were:
a) the amount of training of the subjects;
b) the length of, and amount of information in the mask.
The trained subjects of the string mask experiment show a higher correct score in
the no-mask condition (78% versus 65%).
Differences in mask/no-mask condition were roughly equal in the experiments (27%
and 22% difference in correct score, 16% and 8% in response time for a S.D.A. of
100 ms), but the results of the string mask were slightly "better". Perhaps the
subjects had difficulty in distinguishing between the poorly visible stimulus word
and the readily visible mask word.
Relevance to reading
In reading, only 9% of the fixation durations are below 140 ms, and on the basis
of the present results the distribution starts at a level where the influence of
masking in our experiments has already ended. However, in reading, the retinal
image keeps on changing, whereas in our experiment we had only one stimulus andone mask.
SummaryIn a pattern backward masking paradigm a word of three letters is presented to sub
jects, followed some time later by a mask consisting of either a three-letter word
at the same retinal place, or an overlapping, long string of nonsense words. In both
cases there is not much masking at a S.D.A. of 140 ms, but a S.D.A. of 100 ms de
creases the correct-score percentage by 25% and increases the correct response la
tencies by 8% to 16%.
As far as can be judged from these preliminary data, masking effects are already
terminated at the point where the distribution of fixation durations starts.
ReferencesAndriessen, J.J. and De Voogd, A.H; (1973) Analysis of the eye-movement patterns in
silent reading. I.P.D. Annual Progress Report, ~, p. 29-34.
Averbach, E. and Sperling, G. (1961) Short-term storage of information in vision,In 4th London Symposium Information theory, Cherry, C. (Ed.) Butterworths,London.
Zamansky, H.S., Scharf, B. and Brightbill, R.F. (1971), Backward and forwardmasking as a function of number of letters, interstimulus interval and luminance,Journ. of Exp. Psycho!.,~, 2, p. 235-241. 95
Letter cancellation in words and nonwordsH. Timmers
Introduction
Corcoran (1966) introduced a technique for the study of the reading process that re
quired subjects to search for a target letter, namely the letter e, in a coherent
piece of prose. One of Corcoran's interesting findings was the higher omission rate
for silent e's than for pronounced e's; although the actual pronunciation did have
no effect. To explain this result Corcoran suggested that the acoustic image of the
words plays an important role in visual scanning of text. Recently, this acoustic
factor explanation has been scrutinized by some researchers (Schindler and Jacobs,
1976; Smith and Groat, 1977). Schindler and Jacobs suggested that the importance of
the word containing the target letter to the meaning of the sentence might be corre
lated with the probabiLity of missing that target letter. Their data didn't support
this 'importance' hypothesis, but they did find a significant correlation between the
relative frequency of a word in printed text and the number of subjects who missed
the target letter in the word concerned. This result is in agreement with a) the
study by Smith and Groat (1977) who also found more misses in high frequency words
and b) the finding of Corcoran (1966) that the e in the word,the had the greatest
p-robability of being undetected. The experiments by Smith and Groat didn't reveal an
acoustic effect, that is no more omission errors were made for silent e's than for
pronounced e's.
Interestingly enough, recent eye-movement data (O'Regan, 1977) show that "in exactly
the same preceding context where no syntactic predictions are possible, a low-fre
quency verb will attract more fixations than the article the" and "in context where
the word the is highly predictable from the preceding structure of a sentence, it will
rarely be fixated". These data show an interesting correspondence with the result
from the target letter studies suggesting that familiar words and familiar letter
clusters are readily recognised without the detailed processing required to identify
the target letter as a separate unit. On the basis of these studies it can be derived
that the predictability of a word in a sentence and of letters in a word may be an
important determinant of visual scanning of reading material. Consequently, we were
interested to investigate the effects of 1) words versus nonwords in a letter search
task; 2) the position of the target in a word and nonword and 3) the reading level
of the subjects in this task.
Experiment I
MethodsReading material of two different types was used:
a. a page of 105 words; there were seven words to a line and 15 lines to the page.
Words were separated by a blank space. The words consisted of eight letters. The
letter e had to be marked; there were 32 target letters to the page quite homo
geneously distributed over the letter positions. The number of targets per line
96 was varied from zero to four. At the beginning of each line the letter e was
IPO annual progress report 12 1977
printed which had to be underlined in order to monitor the subjects for the correctline.
b. a page of 105 nonwords consisting of scrambled letters. Again there were eight letters
to a nonword, seven nonwords to a line and 15 lines to the page. The distribution
of targets over the letter positions was a little less homogeneous than on the
page of words. Each subject received the two pages, separated by at least one day.
The subject was given the instruction to mark all the e s he saw on each line.
Thirty-six elementary school children served as subjects, six from each grade
(1-6). The age of the children varied from 6 to 12 years. Across subjects from
the same grade, order of presentation of the two conditions (words vs. nonwords)
was counterbalanced. Reading level of all subjects was individually assessed by
means of the Tanghe reading test. The times taken to perform the task weremeasured.
Fig. 1 shows the error scores averaged over
each grade level for words and nonwords. Sub
jects seem to miss more target letters in words
than in nonwords (Sign-test, p <.01). This dif
ference is the most substantial for the fourth
and sixth grade. For the first, second andfifth grade this effect is less pronounced, forthe third it is even completely absent. From the
fourth grade onwards the absolute level of er
rors clearly starts decreasing in the word
condition; in the nonword condition this trend
already starts in the second grade.
Fig. 1. The percentage of errorsaveraged over each grade for wordsand nonwords.
Results
32 32
28 Words Nonwords 28
24 24
20 20
16 16
12 12
8 8
4 4
0 01 2 3 4 5 6 1 2 3 4 5 6
Grade
In Fig. 2 the percentages of misses are representedfor both conditions averaged over all grades as
a function of the position of the target letter
in the word/nonword. For words the omission rate
is lowest at the beginning (first position) and
at the middle (fifth and sixth positions); for
nonwords such a trend isn't observed and here
the errors are more randomly distributed over the target positions. Only first-grade
subjects don't show this particular effect. We were surprised by the huge error score
at the end of nonwords, but closer inspection of the data showed that this can be
considered as an artefact. On the test page one of the two nonwords, in which the
target is at the eighth position, is followed by a nonword in which a target letter
is present at the first position. Twenty-two of the 36 subjects missed this target
at the eighth position, presumably because their attention was already attracted by
the next letter e. If we correct the scores for this effect, the percentage of misses
is reduced from 31.9 to 5.0. The time subjects spent in performing the test decreases
from 5 min. for the first grade to 2.75 min. for the sixth grade. There is no sys
tematic difference between the times spent on the page of words and that spent on the
page of nonwords; with the exception of the first-£rade subjects who spent 97
32 32
28 Words Nonwords 28
24 24
20 20
16 16
12 12
8 8
4 4
0 01 2 3 4 5 6 7 8 1 2 3 4 5 6 7 IiTarget position
Fig. 2. The percentage of missesaveraged over all grades for each ofthe eight target positions in wordsand nonwords. The double-hatched barat the eighth position in nonwords represents the uncorrected score.
Experiment II
Methods
nearly 5 min. on nonwords and nearly 4min. on words.
In a subsequent experiment the same test
was repeated with a greater number of
subjects in the beginning of the first
class of the elementary school. After
the previous experiment which had a
more exploratory character, we wanted
to ascertain more accurately whether
subjects just starting to learn to read
already process words and nonwords dif
ferently. Another point of interest was
the position effect which was clearly
revealed by the other grades but not
found in the first grade.
The same reading material and the same procedure were used as in experiment I.
Two groups of first-grade subjects, 21 and 23 respectively,took part in the experiment.
Results
32 32
28 Words Nonwords 28
24 24
20 20
16 16
12 12
8 8
4 4
03
01 2 3 4 5 6 7 8 1 2 4 5 6 7 8
Target position
Fig. 3. The percentage of errors averagedover 44 first-grade subjects for each
98 target position in words and nonwords.
Fig. 3 represents histograms of error;
scores as a function of target position
for words and nonwords averaged over the
two groups of subjects. Clearly the results
do not show a systematic position effect
for words. The data suggest that for nonwords the variation of the error scores
due to the target position is less than
for words, but this isn't a statistically
reliable effect. No different error scores
averaged over target positions are ob
tained for words compared to nonwords
(Sign test, p = .37). A strong practice
effect, however, was shown by most of
these subjects: the second session
yielded a substantially lower error rate
than the first (Sign test, p <.01).
DiscussionOur findings may be summarised as follows:
1. more target letters are missed in words than in nonwords by subjects from the
second grade on;2. a target-position effect is obtained in words but not in nonwords; only the
first grade doesn't show this effect.
Subjects do not tend to spend more time on nonwords than on words so that the higher
omission rate for words can't be explained by a speed-accuracy trade off. Even if
the subjects are told to check their marks it will hardly change their error pattern.This simply suggests that in those cases people do not see the target letter when
reading words. Processing nonwords, however, requires the reader to read out the
letter strings that are presented.
The position effect we find in the word data suggests a characteristic of the eye
movements. It seems as if the eight-letter words require two fixations by the rea
der: one at the beginning and one at the middle of the word. This seems to implythat Corcoran's finding, that the later the e appeared in a word the more likely
it was to remain uncancelled, will depend on word length.
The fact that the first-grade subjects don't show the word/nonword effect and the
target-position effect, means that this letter search task is sensitive to the
change of reading ability and- reading processes that occur on the way from first
to second grade.
Subjects from both the first and second grade weren't able to recognise the eight
letter words that were presented. So it may be concluded that the second-grade
children did process the words on the basis of familiar letter-cluster units.
Further study will be necessary to reveal more detailed aspects of the relation
between reading ability and identification of single letters.
Summary
When young subjects (children from 6 to 12 years of age) search for a target letter
in eight-letter words and eight-letter nonwords, they make more omission errors in
the last condition. As far as words are concerned, their error patterns show a
position effect, that is, less errors are made at the beginning and the middle of
the words. First-grade children do not show these effects.
ReferencesCorcoran, D.W.J; (1966) An acoustic factor in letter cancellation Nature, 210,
p. 658.
O'Regan, K. (1977) Moment to moment control of eye saccades as a function oftextual parameters in reading, Paper presented at the "Processing of VisibleLanguage" conference, Eindhoven.
Schindler, R.M. and Jacobs, P.I. (1976) What do we see when we read? Paper presented at the Annual Meeting of the Eastern Psychological Association, New YorkCity.
Smith, P.T. and Groat, A. (1977) Spelling patterns, letter cancellation and theprocessing of text. Paper presented at the "Processing of Visible Language"conference, Eindhoven. 99
Processing of visible language: a Symposium
H. Bouma, D.G. Bouwhuis and H. Timmers
An international symposium on "Processing of Visible Language" was held by IPO in
Eindhoven from 5 - 8 September 1977. The idea for such a meeting had its origin in
contacts between Paul A. Kolers (Toronto) and IPO, and was further worked out by an
organizing committee, including Anthony Cohen (Utrecht), Merald E. Wrolstad (Cleve
land, Ohio), and the authors of the present contribution. There were about 50 active
participants from North America and Western Europe, and 36 papers were presented at
six sessions. Prof.Dr. H.B.G. Casimir addressed audience in the opening session.
The primary aim of the symposium was to establish contacts between three separate
groups who have a profe-ssional interest in the perception and production of visible
languages: a) investigators of human reading, b) graphic designers c) display
engineers. In our times, when new types of reading develop under the influence of
new technologies, the time seemed fit for a concerted reflection on current and
future ways of presenting visible information.
Consequently, sessions were concerned with three themes: a) research on reading
processes b) production of graphic language c) technological display systems.
Insight into human reading has been notably advanced by recent research on pro
cesses that are supposedly involved in the reading of text. Such research has tra
ditionally concerned normal print on paper and, to the extent that, more recently,
electronic text displays have been used, this has been for reasons of experimental
convenience only. It seemed wise to use this accumulated body of work as a founda
tion for the symposium, which has a wider interest. There were sessions on "the
Control of Eye Movements in Reading", "Letter and Word Recognition", "Sentence and
Text Recognition" and "Reading and Listening". The last-mentioned aimed at eluci
dating the possible role of an auditory (speech) component in the processing of
text in contrast to other visible languages such as in mathematical formulae and
cartography. It was decided to leave out many related topics, notable ones
being the broad areas of psycholinguistics and of reading education.
A session on graphic languages dealt wi th the plurality of forms and formats in the
art of presenting visible language. Graphic languages have deep historic roots and
in the course of their development there have been gradual adaptations to the
technologies involved such as carving, writing, printing, photographing, and to the
skills of the readership. The scientific approach to perceptual processes in reading
is of course a much younger one. A combination of arts and sciences in the domain
of producing visible language would offer prospects for more explicit and rapid
adaptation, yet leave ample room for creative expression.
A session on technological media concentrated on options (such as interactive use)
100 and drawbacks (such as limited text quality) of various electronic displays, which
IPO annual progress report 12 1977
are rapidly supplementing print on paper for both the professional and the general
reader. Most activities in this field are very recent and are geared primarily to
convenient technology and minimum cost and less to optimum communication.
The notion that the areas of 'reading research', 'graphics', and 'displays' were
separate, was borne out by the symposium. Reading processes are so fascinating to
study, that there is a trend towards sub-areas for specialists with specific methods
and paradigms, in which there is a danger of losing a clear-cut relation to normal
reading. There appeared to be a need to re-establish such a relationship and, for
the understanding of actual reading as it occurs in daily life, to let relevance
be a dominant factor in the evalutation of results and in the definition of problems
for research. How, otherwise, could such research be digested by the interested
graphic designers and display engineers? If the researcher is perhaps in a good
strategic position, the graphic designer is optimally suited to find creative
solutions for graphic communication. The display engineer is gradually developing
an awareness of the problems in defining the human interface, but,so far, seems
to make too strong demands on the adaptability of the professional users in the
absence of a suitable forum which could define the general user as well.
The cleft between the different approaches became evident too in many discussions
where designers lost track of the relevance of highly debated issues on the psycho
logy of reading or where psychologists were at a loss to provide answers to such
seemingly simple practical questions as desired text quality or lay-out. Discussions
as well as personal contacts may, we hope, inspire creative thinking on the relation
ship between practical questions and tangible research problems.
Considering the widely different backgrounds and aims of the participants the results
of the symposium can only be evaluated on the long term. However, it has already
become clear that there are pressing problems in the presentation of text and other
visual information and that the research on reading processes has great potential for
the definition of basic issues, the provision of methods and even answers. For one
thing, the actual reading that people engage in is the real issue and research should
reflect any shifts in reading habits. So far, the 'practical' producers of visible
language are left with little choice other than to work largely on implicit guidance.
The years ahead should show whether explicit guidance will be added to that and a
possible symposium held in a few years' time may provide an interim answer.
The proceedings of the symposium will appear in 1978 under the editorship of
Paul A. Kolers, Merald E. Wrolstad and Herman Bouma.
Papers presented at the Symposium:
Reading processesThe control of eye movements in reading
Ariane Levy-Schoen and Kevin O'Regan (Paris)
The control of eye movements in reading (tutorial)Keith Rayner (Rochester, N.Y.)
Eye movements in reading: eye guidance and visual integration. 101
George McConkie (Ithaca, N.Y.)
The role and control of eye movements in reading.Kevin O'Regan (Paris)
Moment-to-moment control of eye saccades as a function of textual parametersin reading.
Dennis Fisher (Aberdeen, Maryland)
Understanding the reading process through the use of transformed typography:PSF, CSG and automaticity.
Wayne Shebilske (Charlottesville, Virginia)
Reading eye movements, macro-structure and goal-processing.
Letter and word recognition
Alan Allport (Reading)
Word recognition in reading (tutorial)
Alexander Pollatsek and Thomas Carr (Amherst, Mass.)
Wholistic and rule-governed encoding processes in word perception.
Richard Venezky (Newark; Delaware)
Orthographic regularities in English words.
John Marshall (Nijmegen)
Implications of the acquired dyslexias for the study of normal reading.
Philip Smith and Anne Groat (Stirling)
Spelling patterns, letter cancellation and the processing of text.Don Bouwhuis (Eindhoven)
Letter recognition and word knowledge as determinants of word recognition.
Lester Lefton (Columbia, Carolina)
Peripheral information processing 'in reading.
Sentence and text recognition
Willem Levelt (Nijmegen)
A review of sentence perception research (tutorial)
Alan Baddeley (Cambridge)
Working memory and reading.Mogens Jansen (Copenhagen)
Relations between the qualifications of different groups of readers and different aspects of given text.
John Merritt (Milton Keynes)
Contexts, concepts and reading outcomes.
L.J. Chapman (Milton Keynes)
The perception of language cohesion during fluent reading.
Patricia Wright (Cambridge)
When two no's nearly make a yes: a study of conditional imperatives.
Anthony Pugh (Leeds)
Styles and strategies in adult silent reading.
Reading and listening
Dominic Massaro (Madison, Wisconsin)
102 Reading and listening (tutorial).
Uta Frith (London)
Reading by eye and wrltlng by ear.
Shulamit Reich (London)
How to make reading more like listening: a study of the stroop effect in sentences.
Lee Brooks (Hamilton, Ontario)
A comparison of implicit and explicit alphabets.John Morton (Cambridge)
Some experiments on facilitation in word and picture recognition and their relevance for the evolution of a theoretical position.
Graphic languages
Michael Twyman (Reading)
A schema for the study of graphic language (tutorial).Wim Crouwel (Delft)
Typography, a technique for making a text 'legible'.Anthony Marcel (Cambridge)
Paragraphs or pictographs: the use of non-verbal instructions for equipment.
Robert Waller (Milton Keynes)
Typographic access structures for educational texts.Jeremy Foster (Manchester)
The use of visual cues in text.
Richard Phillips (London)
Making maps easy to read - a summary of research.
Technological mediaRobert Rosenthal (Holmdel,N.J.)
The design of technological displays (tutorial).
Neil Wiseman (Cambridge)
Non-serial language
Willem Hoekstra (Geldrop)Electronic paperwork processing in the office of the future.
Arthur Eger (Delft)
Computer-aided design of graphic symbols and an alphabet.
Richard Jackson (Salfords)
Television text: first experiences with a new medium.
The visibility song
Bouwhuis/Hogue
Sometimes I feel so lonely, sometimes I feel so blue.Sometimes my heart's near breaking, my eyes have lost their view.Now I woke up this morning, I hate to face the day.But when I read a book that's when my sorrow goes away.
I process language, visible language, that graphic language,till the still small voice comes through. 103
104
Cognition
study of dialogues belongs to that part of linguistics called pragmatias.
language use (see e.g. Groenendijk and Stokhof, 1977; Haberland and
Towards an analysis of dialogue organization principlesH.C. Bunt
Introduction
This paper describes some of our recent research on fundamental aspects of dialogues,
focussing on the establishment of a framework for analysing the organizational prin
ciples in goal-directed dialogues. The aim of this research is twofold: on the onehand to provide insight, useful in the design of man-machine dialogues; on the other
we hope to obtain more insight into human language processing by studying the
communicative effects of utterances in a dialogue.
Our research builds upon work in two different areas: Artificial Intelligence (A.I.)and linguistics.
A.I. studies of dialogues that should be mentioned here include those of Grosz
(Grosz, 1977; Walker et al., 1976), Mann et al. (Mann, 1977), and Winograd et al.
(Winograd, 1977, Bobrow et al., 1977).
Grosz has studied 'task-oriented dialogs' in which a computer guides a person in
performing a certain task. Particular attention was paid to the establishment of
mechanisms for resolving definite noun phrases and interpreting elliptic expressions.
In the studies carried out by Mann et al. and by Winograd et al., dialogues on
limited subject domains are viewed as instances of discourse patterns specific to
the subject domain in question. There would be one pattern for travel-information
dialogues, another for client-waiter dialogues in a restaurant, etc. The machine
interpretation of an utterance then comes down to fitting the utterance into such
a pattern. Patterns of this kind are called 'schemas' (Winograd) or 'dialogue games'
(Mann), and are similar in character to data structures used in other A.I. work,
known as 'scripts' (Schank, 1977) or 'frames' (Minsky, 1975).
A limitation of this approach is that it tends to focus on domain-specific aspects
of dialogues, paying too little attention to important general principles of lin
guistic communication.
The linguistic
the theory of
Mey, 1977).
Grice (1975) has formulated some very general principles for cooperative linguistic
behaviour. From the operation of these principles Grice deduces the important notion
of aonveY'sational impliaatuY'e. Conversational implicatures are presuppositions which
a hearer must make in order to interpret a speaker's behaviour as being in accordance
with the general principles. For instance, if one Dutchman asks another: "Wheredoes Jan live?", and the answer is "Jan lives in Europe", it is conversationally im
plied that the answerer does not know in what country Jan lives (else he would violate
the principle that one should be maximally informative). Grice's conversational
principles are very general and vague, and need to be elaborated in order to be ofpractical use.
Utterances in dialogues very often contain expressions that refer explicitly to the 105
IPO annua / progress report /2 1977
(linguistic or extralinguistic) context. Examples are personal pronouns (I, you, it)
demonstratives (that), time-and place indications (now, before, here), etc. Important
work on the interpretation of such expressions has been done by Hausser (1976, 1977).
He has developed a treatment in terms of an extended Montague grammar, in which
sentences from a discourse are interpreted wi th respect to a formal context model. The
most important extension of this framework, compared to standard Montague grammar,
is that such a context model involves a so-called stage description which, roughly
speaking, is a sequence of utterances representing previous discourse with indications
of their semantic interpretation.
The best developed approach to the study of language as a functional system for
communication, is speech act theory. Speech act theory approaches the use of language
as a case of rUle-governed, intentional performance of actions: 'speech acts', and
it deals explicitly with such things as speaker- and hearer-intentions involved in
the performance of such acts (see also Bunt, 1976, p.98). One of the main points of
interest is the study of how utterances function in discourse: whether they assert,
question, promise, threaten, warn, etc. This functional aspect of an utterance is
called its ilZocutionary force. The 'content' of an utterance: that which is asserted,
asked, promised etc., is called its propositional content (Searle, 1969).
Speech act theory is still at an informal stage of development, and there is as yet
no generally accepted system of illocutionary forces. Searle (1969) analyses a
number of illocutionary forces in terms of the intentions and other aspects of the
mental state of speaker and hearer; Allwood (1976) gives a more refined analysis of
the factors that influence linguistic communication.
A framework for dialogue analysis
Linguistic and A.I. studies of dialogues h&Je so far mainly resulted in a number of
interesting data and observations. One of our aims is to establish a general frame
work in terms of which the data and observations can be interpreted and explained
and which can lead to a more systematic exploration of dialogues.
Since there is a great diversity in kinds of dialogues, varying from apparently chaotic
brainstorm dialogues to extremely rigid superior-subordinate instruction dialogues,
it is desirable to choose a certain dialogue genre on which to concentrate. We have
picked out informative dialogues, by which we mean dialogues for the purpose of ex
changing factual information about a well-defined subject domain, while at least
one of the participants knows exactly which information is to be exchanged.
In the present discussion we restrict ourselves in particular to informative dia
logues in which someone (A) wants to obtain certain information and thinks that
someone else (B) might possess this information. Assuming that B is willing to supply
the information, if he has it, A initiates a dialogue with B.To understand what happens in such a dialogue, we should first of all recognize that
A's saying something to B is intended to have an effect on B's state of knowledge,
as the result of B's understanding of what A said. Subsequently, B will react on the
basis of his understanding. Such a reaction may be an observable reaction, such as an
106 answer or an acknowledgement, or a non-observable reaction such as believing what A
said or correcting a wrong presupposition about A. What reaction is sollicited will
depend on B's state of knowledge and the changes brought about in it by A's utterance.
We must, th~refore, consider states of knowledge and the effects of utterances on
them.
The total state of knowledge of an intelligent language user is so complex, that
it seems entirely hopeless to characterize it exhaustively. Fortunately, this is not
necessary: only certain properties of knowledge states need to be specified. To
establish the most relevant kinds of properties, let us take a closer look at various
aspects of a knowledge state, and how these aspects are involved in informative dia
logues. First, it can be observed that not all the knowledge in a knowledge state is
affected by utterances in such a dialogue. An informative dialogue is about a certain
subject domain; knowledge that is unrelated to this subject domain and to the communi
cation situation will be unaffected. For instance, knowledge of the language and how
to use it, and general world knowledge, indispensible for being a competent language
user,will remain unaffected. Other types of knowledge, such as knowledge of the other
participant's aims in participating in the dialogue, do change in the course of theconversation.
Our approach is now as follows. We first try to separate those types of knowledge that
change during informative dialogues from those that remain constant. We make some
general assumptions about the latter, and then study the way in which various types
of the first kind operate dynamically in informative dialogues.
Dynamic factors
In the above characterisation of the kind of informative dialogues we consider here,
we already mentioned three types of knowledge that playa crucial role: the goal of
the dialogue-initiator A, his knowledge of contingent facts in the domain of discourse,
and his presuppositions about B's knowledge of such facts. We label these types of knowledge
as A-GOAL, A-FACTS, and A.B-FACTS, respectively. In order to open the dialogue, A
must also have developed, at least partially, a plan of action for achieving his
goal. We designate A's plan by A-PLAN. We consider the nature of plans below.
A-GOAL, A-FACTS, A.B-FACTS and A-PLAN are factors that we must assume to be among
A's knowledge at the beginning of the dialogue. As soon as the dialogue develops, the
other participant B builds up the same types of knowledge. We have here for each
participant four dynamic knowledge factors of obvious importance. We will call the
dynamic part of the total knowledge state of a dialogue participant A his K-state,
designated by KA, and the various factors in a K-state we call K-factors.*
A closer look at dialogues reveals that, besides the four K-factors mentioned above,
there are several others. For two dialogue participants to act cooperatively, they
* The terminology used here is perhaps not very satisfying. It is somewhat odd to callone's goals and plans 'knowledge'; in Bunt (1978) the term conversational information isused. Our use of the term 'presupposition' is also disputable. We use this term torefer to one's assumptions about goals, plans etc. of the dialogue partner; thereason for calling these assumptions presuppositions is that they typically must bemade in order to interpret what is being said as making sense. The related issue ofthe degree of certainty of various types of knowledge is not discussed here. 107
must have some knowledge of each other's goals and plans. So we must also include in
KA: A's presuppositions about B's goaZ, designated by A.B-GOAL, and A's presuppositions about
B's pZan, designated by A.B-PLAN. Similarly for KB.
This brings us to six K-factors for each participant: GOAL, PLAN and FACTS, and pre
suppositions about these factors in the other participant's K-state.
There is one more type of K-factors that we want to introduce here, the relevance ofwhich is illustrated by the following dialogue sample:
(1) A:
(2) B:
(3) A:
(4a) B:
(4b)
Do you know what the capital of the Netherlands is?Yes, Amsterdam.
That's right.
Why do you ask?
I don't believe you thought I didn't know that:I I
I
II
In sentence (4b) we can identify a part (I) that refers explicitly to B's factual
knowledge (B-FACTS), a part (II) that refers to A's presuppositions about B-FACTS
(i.e. A.B-FACTS), and the complete sentence refers to B's presuppositions about
A. B- FACTS, i. e. to presuppositions about presupposi tions •
In general, we must assume the existence of presuppositions about presuppositions of
the other participant. For instance, in order to correct a presupposition about
oneself, which is attributed to someone else, one must first make the presupposition
that the other participant is making the wrong presupposition:We therefore add to our six K-factors the presuppositions of one more level of in
direction, and designate these (for A's K-state) by A.B.A-GOAL, A.B.A-FACTS, and
A.B.A-PLAN.
In principle, one could go on indefinitely adding presuppositions of higher levels
of indirection, but from an empirical point of view it appears that we can stop here,
given our restriction to informative dialogues. Utterances like "I think that you
think that I think that you don't know this", which would correspond to one level
higher than that in A.B.A-FACTS, are almost unintelligible, and do not occur in informative dialogues.
We therefore conclude, tentatively, that a K-state can be characterized by the nine
factors distinguished so far.
This is not to say, of course, that these K-factors represent all the knowledge
that influences the course of a dialogue. Other factors are for instance:
(1) knowledge of the language, and how it is conventionally used in dialogues;
(2) presuppositions about the other participant's knowledge of the language;
(3) familiarity with the domain of discourse,
(4) knowledge of the social status of the other participant.We deal with these factors in the following way. First, we assume that they are
constant throughout the dialogue. Second, we assume that
(1) both participants have full command of the language;(2) both participants presuppose that the other has full command of the language;
108 (3) the participants are quite (and equally) familiar with the domain of discourse;
(4) the social status of the participants (is such that it) does not give rise to
restrictions on the communication.
For all other kinds of knowledge (such as: "He always exaggerates", "He's a big
liar" we assume that similar conditions are met (such as: "He's not intentionally
misleading me"), such that they have no distorting effect on the communication. A
dialogue, occurring under these conditions, will be said to occur under ideal aonditions,
or simply to be an ideal dialogue.
The notion of an ideal dialogue is of course an abstraction; hopefully a fruitful
one, that allows us to concentrate on the most essential factors involved in informative dialogues.
Summarizing our discussion of knowledge factors:
The basic assumption is that language is used functionally in dialogues in order to
bring about certain changes in the other participant's state of knowledge. A number
of knowledge' factors have been identified that may be changed by utterances in in
formative dialogues. These factors, called K-factors, make up the dynamic part of a
state of knowledge, called K-state. Other knowledge factors that may influence the
course of a dialogue are assumed to have 'neutral' values and to remain constant
throughout the dialogue; a dialogue to which this assumption applies is called an
ideal dialogue.
The nature of plans
With the exception of a plan, all K-factors can be thought of as sets of propositions.
A-GOAL can be thought of as the set of propositions that A wants to be true at the
end of the dialogue; A-FACTS as the set of propositions representing contingent
facts in the domain of discourse, that A believes to be true; A.B-FACTS as the set
of propositions about B-FACTS that A believes to be true, etc. But a plan cannot
be thought of as a set of propositions. In the simplest case a plan is a sequence
of actions, leading to the desired goal. In general, a plan contains several alter
native sequences of actions, the choice of an alternative being determined by the
outcome of actions earlier in the plan.
As an example of such a plan, consider the following part of a plan that is pro
grammed into a computer system that supplies information about train departure
times on the basis of a simple dialogue, in which the system determines what information is desired:
1. Explain to the user which directions he can choose
2. Ask the user which direction he is interested in
If his answer cannot be recognized, go back to 1, but do this no more than threetimes.
If his answer is recognized as "R", then3. Confirm:"direction R"
4. Ask the user whether he wants to travel today.
If the answer is yes, then
5. Ask whether he wants to know the next departure times.If the answer is yes, then
109
6. Provide today's next departure times in the direction R.
If the answer is no, then
7. Ask what time he wants to leave approximately, etc.
Such a plan is conveniently represented by a labelled graph as follows:
tFig. 1. Graphical representationof a plan.
The expression 'max.3' at the loops in this
graph indicates that the loop should be taken
at most 3 times. The plan represented here
is in fact part of the plan used in the ex
perimental dialogue system described else
where in this volume (Muller, Nooteboom,
Willems, 1977). Note that every path in
Fig. 1, leading from the top node to a
terminal node, such as ® ' contains a se
quence of actions leading to the desired
goal. Each of these actions has a goal of its
own, a subgoaZ relative to the overall goal,
and the fulfillment of all sub goals along
such a path is equivalent to the fulfilment
of the overall goal.
The plans that people use in dialogues are
usually not completely elaborated, like the
plan represented in Fig. 1. Rather, these
plans have the form: "I'll do this first;
then, if he does such, I'll do so; if he does
something else, I'll do something to make
sure that so and so". Only part of such a
plan is elaborated, another part does not
specify the actions to be taken, but only a
subgoal to be achieved. We call such a plan
an incomplete plan.
An important property of the kind of plans we consider here, is that they begin wi th an
action. We will refer to the subgoal of this action as the immediate goal of the parti
cipant in question. We designate the immediate goal of participant A by A-ImGOAL.
Turn-taking
We ultimately hope to describe how dialogue utterances function, in terms of tl1eir
effects on K-states. We therefore have to indicate for each of the K-factors at
which point in the dialogue we are considering them. We do this by indicating at
what tu~ in the dialogue they occur. Turn-taking in natural dialogues is a com
plicated phenomenon (see Sacks et al., 1974), but in order to simplify matters we
will assume that speakers do not speak simultaneously, etc.
110 In that case the following notation will do.
Let A and B be
the dialogue by
B's K-states at
the participants in a dialogue, and let
a sequence u(~) of utterances. Let KID)
the beginning of the dialogue.
A be the one who initiates
and K(o) designate A's andB
As the result of A's utterances u(o), B's K-state changes from K(o) to a new state
K(l) " Next it is B's turn" let u t1 ) be the sequence of B's utt~rances. As the resultB 'B (0) (1)
of these utterances, A's K-state changes from KA · to a new state KK ,etc. so
that the following picture emerges.
Fig. 2. K-states atsuccessive turns.
K(O)A
U~)~
11U(1)
00( B
Kl11
AU~) ..
Jl
11
We will designate the K-factors in a K-state at turn i, by
the index i, so in KA(i) we have A-GOAL., A-FACTS., etc. This1 1
description of the dynamics of K-states is still too crude for
our purposes, because we want to consider the changes in K
state that result from a single utterance within a turn. This
can easily be done by adding intermediate K-state5 Kl i ,j) be
tween the states Kli ) and Ki i+1); however, in order to avoid
the notational complexities that arise from this addition we
will simply assume here that at each turn, the dialogue parti
cipants make only one utterance.
Question-Answer sequences
The conceptual framework so far developed is intended to serve
as a basis for analyzing principles of dialogue organization,
point of issue being that changes in the state of knowledge of
a listener can be described in terms of changes in K-factors
which are systematically related to properties of dialogue
utterances. The establishment of a system of rules capturing
this relationship would be of particular interest in the
design of man-machine communication systems. In this section we
give an impression of what can be done in this direction on the
basis of our conceptual framework, by considering some funda
mental aspects of question-answer sequences in ideal informa
tive dialogues.
In a dialogue, each participant interprets the utterances of the other participant
and generates new utterances. In the interpretation of utterances it is useful to
distinguish at least the following two aspects:
(i) the determination of what is being communicated;
(ii) the determination why this is being communicated.
For instance, the interpretation of the utterance "John bought a new car yesterday"
involves (i) the recognition that one is being told that John bought a new car
yesterday, (ii) ~he realization that the speaker presupposed that he was telling
something not already known to the listener.
In the generation of utterances we can distinguish at least the following aspects:
(i) the setting up of a new (sub)goal to be achieved next in the dialogue;
(ii) the formation of a plan for achieving that goal. 111
supposed that B might
does not belong to
Let us now look at the interpretation of a question and the generation of an answer
in terms of changing K-factors. General observations on linguistic communication
that have been made by Grice, Searle and others (see the introductory section of
this paper), can often be given a more precise formulation in these terms (see
Bunt, 1978 for more details).
Let the dialogue participant A pose,at stage i in the dialogue, the question uli
) to
B, with the propositional content p' For instance, ul i ) =."Did John buy a new car?"
with p = "John bought a new car." B's interpretation of ul1) as a question means that
he recognizes that A's immediate goal is to obtain the information described by p.
In other words, B.A-ImGOAL i +1 will be that p belongs to A-FACTS i+1 (which was also
A- ImGOAL i ; this corresponds to Searle's sincerity condition for questions).
B's realization as to why A posed the question ul i ), involves (under ideal conditions)
the following changes in B's K-state:
- B presupposes that A didn't have the information p himself, i.e. "p does not be-"long to A-FACTS i ' belongs to B'A-FACTS i +1 .
B recognizes that A would only ask B the question ul1
) if A
have the information p, i.e.""p does not belong to B-FACTS'.'1
A.B-FACTS i " belongs to B.A.B-FACTS i +1- B recognizes that A would only ask B for the information p if A did not suppose
that B was just about to supply p anyway, i.e .""i t is not the topmost action in
B-PLAN i to tell p" belongs to A.B-PLANi
" belongs to B.A.B-PLAN i +1 .
The first and third of these points correspond to Searle's preparatory conditions for
questions; the second point appears to be something that Searle has overlooked.
On the basis of his interpretati~n of uli ) , B will generate a response. An important
part of B's interpretation of ul1) was the recognition of A's immediate goal. If
correctly recognized, A's immediate goal was what B thinks it is, i.e.
B.A-ImGOAL i +1 = A-ImGOAL i
and this goal is that p belongs
question seriously and attempts
haviour will follow if B adopts
B-GOAL i +1 = B.A-ImGOAL i +1
to A-FACTS. l' Under ideal conditions, B takes A's1+
to behave cooperatively. Optimal cooperative be-
A's presupposed immediate goal as his own goal, i.e.
Subsequently, a plan can be formed to achieve this goal. The simplest possible plan
consists of simply giving the information p to A, provided the information p is
available in B-FACTSi
+ 1 . This would be an utterance with the illocutionary force of
an assertion; the condition of availability is one of Searle's preparatory conditions
for making an assertion, as well as a special case of Grice's'principle of quality'.
The simple rule governing the generation of a response, that seems to operate here,
is: If B-GOAL i +1 is that p belongs to A-FACTSi
+1 , and p belongs to B-FACTS i +1 , then B
PLAN i +1 is the single action of telling p to A.
Execution of this plan is answering A's question.
Of course there are many other kinds of replies that B can sensibly give to A's
112 question, depending on B's K-state. For instance, if B thinks that the question con-
tains a wrong presupposition, he may be expected to correct this.
We hope that it will be possible to develop a system of rules within the present
framework comprehensive enough to be of interest as a model for human informative
dialogues and to guide the operation of a machine engaging in an informative dialogue
with a human user.
Summary
'Informative dialogues', i.e. dialogues with the aim of exchanging certain well
defined information, are studied for their organizational principles. Basic to this
study is the assumption that language is used in a functional way in order to influ
ence the other participant's state of knowledge. A number of knowledge-state factors
have been identified that are subject to changes during an informative dialogue. These
factors, which include a goal, a (possibly incomplete) plan, knowledge of the domain
of discourse, and various types of presuppositions about the other participant, make
up the variable part of a knowledge state, called K-state. Other knowledge factors
that might influence the course of a dialogue are assumed to have no distorting effecton the communication and to remain unchanged throughout the dialogue; a dialogue to
which this assumption applies is said to occur under 'ideal conditions'. The aim of
this study is to develop a system of rules for the generation and interpretation of
utterances in informative dialogues occurring under ideal conditions, by relating
properties of utterances to properties of K-states. As an illustration, a simple rule
has been sketched for giving the complete direct answer to a question.
References
Allwood, J.(1976) Linguistic communication as action and cooperation: a study inpragmatics,Gothenburg monographs in linguistics, ~.
Bobrow, D.G., Kaplan, R.M., Kay, M., Norman, D.A., Thompson, H. and Winograd, T.(1977) GUS, a frame-driven dialog system, Artificial Intelligence, ~, p. 155-173.
Bunt, H.C. (1976) Some recent developments in semantics, I.P.O. Annual ProgressReport, D, p. 94-107.
Bunt, H.C. (1978) Dialogue analysis and speech act theory, in: Papers from the IVthScandinavian Conference of Linguistics, Middelfart (Denmark).
Clark, H.H. and Clark, E.V. (1977) Psychology and language. Harcourt Brace Jovanovich,New York.
Grice, H.P. (1975) Logic and conversation, in: P. Cole, J.L. Morgan (Eds.), Syntaxand semantics, vol. ~, Speech acts. Academic Press, New York
Grosz, B.J. (1977) The representation and use of focus in dialogue understanding.Ph.D. Thesis, Berkeley (California).
Groenendijk, J. and Stokhof, M. (1977) Semantics and pragmatics, a theory ofmeaning. Paper presented at the Conference on empirical and methodologicalfoundations of semantic theories for natural languages, Nijmegen, March 14-18.
Haberland, H. and Mey, J.L. (1977) Linguistics and pragmatics. Journal of Pragmatics,l, p. 1-11.
Hausser, R.R. and Zaefferer, D. (1976) Questions and answers in a context-dependentMontague Grammar. In: Proceedings from the Bad Homberg workshop on formal semantics.
Hausser, R.R. (1977) The semantics of mood. Paper presented at the Symposium onSpeech Acts and Pragmatics, Dobog6ko (Hungary), September 5-8.
Mann, W.C., Moore, J.A. and Levin, J.A. (1977) A comprehension model for humandialogue. Proc. 5th Int. Joint Conference on Artificial Intelligence, Boston(Mass.), August 22-25. 11 3
114
Minsky, M.L. (1975) A framework for representing knowledge, in: P. Winston (Ed.)The psychology of computer vision, McGraw-Hill, New York.
Muller, H.F., Nooteboom, S.G. and Willems, L.F. (1978) An experimental system forman-machine communication by means of speech, this issue.
Sacks, H., Schegloff, E. and Jefferson, G. (1974) A simplest systematics for theorganisation of turn-taking for conversation. Language, ~, p. 696-735.
Schank, R.C. and Abelson, R.P. (1977) Scripts, plans, goals and understanding.L. Erlbaum, Hillsdale (N.J.).
Searle, J.R. (1969) Speech Acts. Cambridge University Press, Cambridge (U.K.).
Walker, D.E. (1976) Speech understanding research, final technical report.Stanford Research Institute, Menlo Park (California).
Winograd, T. (1977) A framework for understanding discourse. A.I. Memo, StanfordUniversity.
Knowledge of Dutch three-letter words
Rectification
In the previous issue of the Annual Progress Report the following article contained
some serious printing errors:
Knowledge of Dutch three-letter words
J.e. Jacobs, A.L.M. van Rens and D.G. Bouwhuis
IPO Annual Progress Report, 1976, ~, p. 77-84.
On page 78 of this article table 1 gives the probabilities of letters appearing in
the three positions of Dutch three-letter words. This table contains four printing
errors and should be replaced by the one printed below.
1st letter 2nd letter 3rd letter
% % %
b 7.3 e 21.0 t 11.1
d 6.6 0 20.1 1 10.1
P 6.5 a 18.5 s 9.8
1 6.5 i 14.3 k 9.7
k 6.3 u 13 .5 n 7.2
h 6.3 r 3.5 p 6.3
r 6.2 1 2.7 f 5.6
t 5.5 n 1.4 m 5.5
m 5.5 k 0.6 e 5.5
a 5.0 d 0.6 g 4.6
w 4.3 t 0.4 r 4.2
0 4.2 s 0.4 i 3.6
n 4.1 p 0.4 a 3.2
g 3.8 h 0.4 d 3.1
e 3.8 g 0.4 b 2.5
v 3.2 c 0.4 u 2.4
s 3.2 b 0.4 0 2.0
z 2.8 m 0.3 x 1.5
j 2.8 j 0.3 h 0.8
f 2.5 w 0.1 w 0.7
i 1.3 v 0.1 c 0.4
u 1.1 f 0.1 z 0.1
c 1.1 z 0.0 y 0.0
q 0.1 Y 0.0 v 0.0
y 0.0 x 0.0 q 0.0
x 0.0 q 0.0 j 0.0
Table 1. Letter distributions of initial,middle and final letters of the 713 realDutch words.
IPO annual progress report 12 1977
115
116
Ergonomics and perceptual aids
Legibility of rectilinear digitsH. Bouma and F. L. van Nes
Introduction
Over the last decade, electronics technology has advanced very rapidly, leading to
great decreases in size and price of all sorts of information-processing equipment.
Consequent changes in controls and displays present opportunities as well as challenges to ergonomics. Controls, on a calculator or oscilloscope, for instance, tend
to get more numerous and smaller, and thus more difficult to handle. On displays,
the presentation of analog information is being superseded by representations using
letters and digits. These symbols in that case generally consist of a number of parts,
either dots, arranged in a matrix, or line segments. The same parts thus figure in
different symbols. The shape of the latter, and of the display device itself, are
mainly chosen on technical and economic grounds. It then remains to be seen how the
resulting text or numbers compare with what people are still most used to: letters
and digits which are not divided into parts, and printed in black on white paper.
As yet, little attention has been given to the question of how well people can read
text, numbers or other symbols generated on electronic displays. In our institute
some research on the legibility of matrix letters for CRT screens was done (Bouma
and Leopold, 1969; Bouma and van Rens, 1971), and the present paper will present
data on the legibility of the rectilinear, segmented digits now appearing on most
displays for numerical information. Such digits coexisted for some time with digits
containing curvilinear parts (Fig. 1), as successors to the conventionally shaped
numbers displayed by Nixie tubes. Nowadays, (almost) completely rectilinear digits,
like those in row d of Fig. 1, appear to have gained.
8
b
c
d
IJ
,,
1I
--,c.
-,L
-,::.,
7J
'.',
L1J
Ll1
LII
'.-'LJ
,::J
5
-,o)
o
E,
-,,
J•
nJ
-,I
CILI
B
LILI
9
oJ
9
o~
,-,U
o
IILI
,-,LI
Fig. 1. Examples of digit shapes, built from curved and straight line segments. Rowd shows the shapes investigated in the ,experiment reported. 117
IPO annua1 progress report 12 1977
When judging the legibility of symbols, 3 factors should be taken into account:
visibility, discriminability and acceptability. For line-segment digits, this leadsto the following considerations:
visibiZity, or identifiability, depends on contrast between the digit on display
and its background. It can be obtained by selecting a suitable combination of line
segment dimensions and luminances, and by avoiding reflections on the display from
other light sources;
discriminabiZity, or individuality of different digits, is important, even more so
than with letters, because there is generally no redundancy in numbers, as opposed
to words. Line-segment digits may be prone to confusion owing to their similarity of
shape, therefore this factor needs special attention;
the acceptabiZity of symbols depends on their correspondence to readers' concepts,
in this case, of digit shapes. A high acceptability, i.e. reasonable degree of
resemblance to those concepts, should be aimed at in displays used by the general
public, like watches or calculators. Differences between the shapes of handwritten
digits in different countries, for instance in the digits 1, 7 and 8, may reflect a
problem in this area. Professional users of digit displays may be satisfied with
somewhat more unusual digit shapes, as long as these possess a high individuality.
Our experiments were mainly concerned with discriminability. We chose perceptually
difficult conditions in order to obtain a sufficient number of errors for analysis.
Jhe results of this analysis led to suggestions for improved digit shapes. The improvements aim in the first place at increasing digit discriminability, and their
acceptability in the second place. The present paper is a brief report of the ex
periments which will be published elsewhere in more detail.
MethodTwo observational conditions were used:
1. reading or recognizing luminous single digits, of the type shown in the lowest
row of Fig. 1, at a distance of 16 m. The height of the digits, 19 mm, then
corresponds to a visual angle subtending 4.1 minutes of arc. At this distance,
about 60% of the presented digits was correctly recognised.2. parafoveal reading at a normal distance, i.e. 57 em, in the right visual field.
Both isolated digits and three-digit numbers were presented, in two separate
groups, at an excentricity which also led to an average recognition score of about
60%. This meant excentricities of 30 degrees for the single digits, and 5, 7l and
10 degrees for the hundreds, tens and units, respectively, of the three-digit
numbers. In these parafoveal conditions the stimuli were presented for 100 ms,
to eliminate the possibility of foveal stimulus projection through eye movements.
Ten subjects participated in all parts of the experiment. Each digit was presented
10 times, in random order, in the separate parts.
Results
The average scores of correctly recognised digits by the subjects were:
118 - for distance reading: 64%
- for excentric reading of single digits: 65%
- for excentric reading of numbers: 81%, 37% and 53% for hundreds, tens and units,respectively.
Fig. 2 shows the recognition
percentages for each digit, a
veraged over the 3 parts of the
experiment. Considering averages
is justified because the dif
ferent observational conditions
02468Number of line segments
1234567890Digits
)
0 1\0\>
\
\\
o
40
100
Fig. 2. (left) and Fig. 3 (right). Percentages ofcorrectly recognised presentations, shown at theleft for each digit (averaged over all observational conditions) and at the right as a functionof the number of line segments of the digits concerned. The points at five and six segments eachrepresent averaged percentages from 3 digits (2,3, 5 and 0, 6, 9, respectively).
yield similar results. A compar
ison of Fig. 2 with the way in
which the digits are made up leads
to the assumption that a relation
exists between a digit's correct
score and the number of segments
it counts. In Fig. 3 therefore
the percentage of correct res
ponses is shown as a function of
the number of line segments of
the digits concerned. The point
for 5 line segments represents
the average correct score of the
digits 2, 3 and 5, because these
are all made up of 5 segments.Likewise, the point for six line segments represents the average correct score of
0, 6 and 9. The other points in Fig. 3 are each related to one digit. The figure
demonstrates clearly that correct scores decrease with the number of segments which
digits count.
IJ'i!.!: 80
Q)..8 60III
I..(,)Q)....o(,)
11 9
61 2 3 4 5Number of difference
segments
~
\1~
'<"-')~
~o
Fig. 4. The percentage of confusionsas a function of the total number ofdifference segments between the pairsof digits concerned, averaged overdigit pairs and observational conditions. See text for an explanationof "difference segments".
Subjects are quite willing to make guesses about the identity of digits which they
see more or less vaguely. The response "illegible", though explicitly allowed, was
% given for only 1% of' all stimulus presen-
'10 tations. This means that the incorrect
responses almost all consist of confusions
between digits. It was found that con
siderable differences exist between the
frequencies of all the possible confusions,
in other words, they occur systematically.
Analysis of the system underlying the
confusions may yield information on the
perceptual processes which occur duringdigit recognition. Fig. 4 shows that an
inverse relation obtains between the per
centage of confusions - averaged over all
observational conditions and all digits
and the total number of "difference seg
ments". This total is the sum of the
number of line segments in which the di-
gits concerned differ, by addition as well as omission. For example, the total
number of difference segments for the digits 4 and 7 is three. The important conclu
sion that can be drawn from Fig. 4 is that the probability of confusion with another
digit diminishes strongly as that digit differs by the greater number of segments
from the digit actually presented.
Confusions between digits are generally not symmetrical. It turned out, for instance
that at a distance of 16 m the presented digit 8 was fourteen times read as "9",
whereas the presented digit 9 was only read three times as "8". In this case, as
well as in many others, the digit perceived oftener contains fewer segments than the
digit presented than the other way round, a kind of simplification tendency in per
ception.
Summarising, the results can be described with 3 general rules:
1. the smaller the number of segments from which a digit is built up, the better itis recognised;
2. the larger the total number of segments in which two digits differ, the less the
probability that they will be confused;
3. a digit which is not correctly recognised, is more often perceived as one with a
structure simpler than that of the presented digit than the other way round.
Discussion
depend on the distinctive function attribut
able to the separate segments. If, for
instance, a common segment occurred in
all digits - which is not so - its per
ception would be of hardly any importance.
The distinctive role of the seven line
segments has been determined for all
digit pairs on theoretical grounds, i.e.as regards the contribution of individual
line segments to the difference in shape
between the members of a pair. Here we
shall only consider those pairs which
differ in one line segment. The analysis
of pairs with 2 differing segments is
analogous, and does not lead to different
conclusions (Bouma and van Rens, 1971).
Fig. Sa depicts all digit pairs differing
only in one segment, written inside the
segments concerned. It turns out thatsegments C and F each have a distinctive
function for 2 digit pairs, and thatsegments D and E have no distinctive
b
c
B
aE
F
Fig. Sa. Digit pairs which differ in onlyone segment, depicted in that segment. Thisshows that the segments C and Fare imporuRnt in distinguishing between differentdigits.Fig. Sb. Number of confusions between thedigit pairs from Fig. Sa, written in thecorresponding segments. The numbers 38and S4 each are the sums of confusions between the two digit pairs depicted in segments C and F, respectively.
Fig. 4 suggests that the number of diffel'enoe segments plays an important role in re
cognising rectilinear digits. Consequently, it is interesting to investigate whether
all segments are equally prominent in perception; i.e., whether they all carry the
same perceptive weight. This will mainly
120
function for pairs with just one difference segment.
Obviously these considerations are of limited value if they do not correlate with
the perceptive significance of the separate segments. This significance should, for
segments A, Band G, be higher than for D and E, but lower than for C and F. It is
very simple to check if such relationships hold, by counting the number of confusions
between the digit pairs concerned. The result of this operation is shown in Fig. sb:
in each segment the number of confusions is shown between the digit pair(s) from the
corresponding segment of Fig. Sa. The agreement between these numbers of confusions
and the related numbers of digit pairs 0, 1 or 2, is good. When this analysis is ex
tended so as to include digit pairs differing in 2 segments, a correlation coefficient
of +0.97 is found between the distinctive role of the segments and the number of
confusions between corresponding digit pairs (Bouma and van Rens. 1971).
Consequences in the design of rectilinear digit shapes
Can we use the experimental results now to obtain a better design for line-segment
digit shapes? As stated in the introduction, by "better" we mean in the first place
a higher discriminability of digits. Since the number of confusions increases
sharply with a decrease in difference segments 8, as demonstrated by Fig. 4, we seek
to avoid low values of 0, especially 8=1. The upper half of Fig. 6 shows the digit
shapes investigated, whereas in the lower half a number of alternatives are drawn,
with the aim of increasing the distinction between digit pairs with 0=1. The follow
ing points may be considered:
1. changes in shape for /6/ and /9/. This would lead to 8=2 instead of 0=1, for 5
of the 7 pairs fr0m Fig. Sa. One new pair with 8=1 would result, however, viz.
/4/ and /9/. Still, the number of pairs with 8=1 would be 4 less, whereas the
acceptability of the /6/ and /9/ would probably not suffer from the change.
2. a changed position of /1/ in the segment network, viz. from the right to the
left vertical line segments. Especially in numbers with more than one digit, this
would enlarge the distinction between /1/ and /7/, which pair would then formally
get 0=5 instead of 8=1.
3. another way to increase 8 for the digit pair /1/ and /7/ is to change the shape
of the /7/ as drawn in Fig. 6. Indeed, 8=2 results for this pair; however, for
the pair /7/ and /9/ 8 decreases from 2 to 1. So no overall improvement is ob
tained. Also, it remains to be seen whether the lower /7/ in Fig. 6 is as acceptableas the higher one.
4. Similar objections can be raised to the changed digit /0/ in Fig. 6. The lower
shape will hardly found to be acceptable, and would create a new pair with 8=1:
/0/ and (the new) /6/, which had 0=3. The motive for the change in shape of the
/0/ would be an increase from 8=1 to 0=3 for the pair /0/ and /8/.
So far, we have only considered omissions and additions of line segments, retaining
their original shape. It is also possible to design improved digit shapes by accentu
ating their perceptually important line segments. Possibilities include making such
segments a bit broader, say 50%, or, in the case of horizontal segments, to lengthen
them somewhat, e.g. 30%. It can be shown that the acceptability of such digit shapes
has not decreased. An increase in acceptability may be obtained by increasing the 121
slant of the digits to 20 0• This has the added advantage of increasing the distinct
ion between line-segment numbers, as a group, and capitals.
L~:-l,,-, '0 1"'''1 ,-,: C""': :: ____ 1 CI CI I-i": :;
fL..J ;=1'___J
:: I ;'",.J ;'.-_.J I-' Lr U_H ;~
I, ,'_Ir,
I '-I oFig. 6. Upper row: the digit shapes investigated; lower row: possible alternativesfor 5 of the 10 digits. The alternatives originate from the experimental data.
In conclusion, we arrived at a new design for rectilinear digit shapes as shown in
Fig. 7. Only after an experiment with a display incorporating this design, under the
same observational conditions as in the present experiment, could the effect of the
proposed changes on discriminability and acceptability be evaluated. Unfortunately,
such an experiment was not possible because no displays based on the new design have
been constructed.
/ -,C
LI1 56 q "I_I
122
Fig. 7. Proposal for improved seven-segment digit shapes. The improvements relate inthe first place to the discriminability of the digits, and only in the second placeto their acceptability.
Summary
Technical and economic developments have led to a rapidly increasing use of new media
for text presentation. Little is known about the legibility of the letter- and number
shapes which are used for such electronic media. This paper deals with research on
the recognisability of numbers built up from straight line segments, which therefore
have a schematic form. Erroneous recognition of such numbers leads to confusion be
tween them. The distinctive function of the individual line segments has been determined
from the errors. This analysis leads to improved design of the number shapes. First,
the improvements aim at increasing the discriminability of the numbers; second, im
provement of their acceptability, i.e. resemblance to the traditional number shapes,
plays a role.
Acknowledgement
The authors are greatly indebted to A.L.M. van Rens, for his impor.tant contribution
in designing the experiment and elaborating the data from it.
ReferencesBouma, H. and Leopold, F.F. (1969) A set of matrix characters in a special 7 x 8
array, I.P.O. Annual Progress Report, i, p. 115-119.Bouma, H. and van Rens, A.L.M. (1970) Cijferherkenning bij een indicatorbuis met 7
lijnsegmenten, I.P.O. Rapport no. 179.
Bouma, H. and van Rens, A.L.M. (1971) Completion of an alphanumeric matrix displaywith lower-case letters, I.P.O. Annual Progress Report, ~, p. 91-94.
123
A typewriter for a motorilcally handicapped person, operated by headmovements
PH van der Heijden*, H. Bouma, H.E.M. Melotte and F. Meyer**
Introduction
A typewriter has been made for a motorically handicapped person deprived of the
normal use of his arms and hands. It enables him to display his text on a screen
before it is typed out on a printer.
The handicapped person used the machine intensively and successfully for over a year
both to communicate with others and to keep himself creatively occupied.
Since it was to be expected from the beginning that the patient would retain for a
relatively long time the ability to move his head, the search for a solution was
concentrated on that possibility. The main idea was to fix a lamp to the patient's
brow, so that by movements of the head he would be able to direct a beam of light
onto a "keyboard" with photosensitive cells (see Fig. 1).
Fig. 1. The user operates the typewriter by directing the light beam from a lampfixed to his brow upon a "keyboard" with photocells. Other operating functions werelater added to this panel.
Members of Philips Research Laboratories*** and of the Institute for Perception
* user of the apparatus
** Philips Research Laboratories, Eindhoven
124 *** A.H.T. Sanders and H.A.J. Sanders provided the electronics
/PO annual progress report /2 /977
Research formed a team to cooperate on the implementation of this project. It turned
out that the idea was already known in the literature (Soede, Stassen, van Lunteren
and Luitse, 1973) and there were even a number of prototypes in existence in theNetherlands. However, since it was not possible to get hold of such a prototype in
the short term and postponement was not acceptable, the team decided, in order to
save time and cut out lengthy development work, to build a similar apparatus them
selves, based on a slightly modified design.
After a week of preparatory discussions, work on the construction of the apparatuswas started on 23 September 1976, and on 22 October 1976 the equipment was installed
in the user's home. A commercially available electronic typewriter is used, com
prising a memory and display screen (Superbee), coupled to a printer. The signals
are given with a small lamp fixed to the user's brow, which directs a beam of visible
light on to a panel with photosensitive cells. Via an "interface" the "strokes" on
the cells are passed through to the typewriter and displayed on the screen. In this
arrangement there is full access to the letter, numeral and character facilities of
the Superbee display screen. The built-in memory provides ample possibilities for
altering and correcting the text before print-out.
Evaluation
The user himself regularly wrote an evaluation (in Dutch), a selection from which
is given in translation below.The user has the apparatus set up in his study beside a pegboard for letters and
newspaper articles and a page-turning device, which he can also operate from his
wheelchair (see Fig. 2).
Fig. 2. The equipment set up in the user's study in accordance with his personalwishes. 125
24 October 1976
First general impressions
I have had the apparatus at home now for just over two days. The day before yesterday
I worked on it from 12 to 3 in the afternoon, and then from 7 to 10 o'clock in the
evening (first flush of enthusiasm:). Yesterday I had a break, and today I have again
worked on it from 4 to 6 and this evening from 7 to 11 o'clock. This proves that I
have so far had no trouble from fatigue. On the contrary, the activity has something
peaceful about it: you sit quietly and set down your thoughts at a calm pace. This
obviously has something to do with the satisfaction of being able, for the first
time in months, to put something down on paper without outside help, and also with
the fact that this is real "activity", as opposed to the other, passive pursuits such
as reading and watching television.
So far the head movements have not been noticeably tiring, and directing the light
beam on to the cells is no problem: it is possible to keep the beam in position for
the required half second and then to move on to the next letter. The pace is calm
and regular, the clocking of the relays reminding one of the ticking of a Friesian
clock.
After two days the number of wrong strokes has dropped to a very acceptable level
(one error in two lines I now regard as rather a poor score, due to carelessness
and/or tiredness). This makes me wonder whether it would be possible to shorten
the exposure time without affecting the present result.
3 November 1976
My findings after twelve days are still just as favourable. Fatigue is not a signi
ficant factor. Last Saturday I even worked four hours in one stretch and this ob
viously has much to do with the tremendous stimulus of being able to take up contact
again, to take my own initiatives. In short, a tonic. The ache in the neck I spoke
about some days ago has gone, without there having been any need to change the
position of the panel. It seems I am attached to this set-up.
2 December 1976
Experience with the typing is unchanged after six weeks. The number of mistakes I
make is small, one in every ten lines. There are hardly ever any wrong letters or
numbers; almost always the error is forgetting a space between two words and "new
line" after the break sign. It should be remembered that the typing speed is very
leisurely, much more so than normal typing. It is also possible that I am unconscious
ly influenced by the certainty of being able to make corrections in good time:
mistakes can be put right.
13 March 1977
In essence, what I reported earlier is still fully applicable after five months: I use
the apparatus very regularly, at least a couple of hours a day, and I find it rela
tively easy. An important point is, of course, that for me this is the only means of
expressing myself in writing (and since speech is becoming more difficult for me,
126 written communication is in any case receiving more emphasis). By this t mean that
an outsider, accustomed to communicating without difficulties in writing and by word
of mouth, probably has a mistaken idea about what is easy and not easy, since he
will always be inclined to compare the apparatus with a pen or a normal typewriter.
For me this comparison is meaningless and therefore I never make it; all that matters
to me is whether, given my circumstances, I can do with this machine the work I want
to do in the time available to me, and my conclusion is then that this is completely
the case.
6 September 1977
After more than ten months of use there is no reason for me to revise what I have
said earlier. This implies that working with the machine is still for me the ideal
solution, and that I am unable to indicate any way in which the apparatus might be
improved - apart from the points I mentioned earlier, but none of which affect the
principle. Compared with the earlier situation, my head is directed more downwards
during operation. The complaints have tended to increase since then, and my head is
much more often inclined to droop. This tendency seems to run just about parallel
to my general physical and above all mental condition at a particular moment. The
result of this is remarkable: sometimes I find it completely impossible to direct
the beam onto the letters, but when I am in a good condition I am still able, for
tunately, to work with the apparatus for hours on end.
The purpose of this evaluation is therefore rather to draw attention to certain
developments in my disease, insofar as they relate to the use of the machine. In the
past period, for example, I have noticed a distinct weakening of the throat and neck
muscles, and speech - already rather a problem since May 1976 - has now become prac
tically impossible.
It was only after considerable hesitation that I decided to use the machine as an
instrument for speech as well as for written communication. Up to last June I was
still able to speak a little, although it was tiring and also painful. After that
my speech rapidly deteriorated and even my wife was often no longer able to understand
me. This brought a sense of growing isolation, with all its associated problems and
frustrations. Even so, it still seemed as if breaking through this by resorting to
the machine implied an admission that I had once and for all given up the use of the
spoken word. I mention this here because I have repeatedly noticed in myself this
desire to hold on to a vanishing function right up to the very last moment. Possibly
it is a common phenomenon in invalidity; in that case it ought to be taken into account
in getting invalids accustomed to the use of aids such as this typewriter (and speech
communication) machine.
However this may be, my resistance was finally overcome and for about two months I
have regularly been using the machine for conversations. The normal arrangement is
for my conversation partner to sit beside me and to read the display screen while I
write. I appreciate it if he completes a word or a sentence so that I need no longer
go on writing. I must not give the impression that conversing in this manner is without
problems: the writing pace is 55 to 60 strokes per minute, that is to say ten words a
minute, which is truly a snail's pace and requires a great deal of patience, particular-
ly from the conversation partner (for the invalid himself the patience factor is not 127
such a problem, because he has already had long training in the exercise of patience) •
Depending on the nature of the conversation, the "atmosphere" in which it took place
and the degree of mutual understanding, I have in this way been able to conduct many
conversations very satisfactorily. I have noticed that, although it is necessary to
be somewhat sparing in expression, it is not a good thing to resort to a kind of
telegram style, because in that way the conversation is not given a chance and gets
no further than the exchange of information. I have a real problem only with a con
versation partner who "does not listen", that is to say goes on speaking while I write.
I think there are various reasons for this. In the first place, there is an enormous
difference of speed between his share in the conversation and mine, which does not
help the homogeneity of the co~versation. Furthermore, I am forced to divide my
attention because I have to write and listen at the same time. And finally, my re
marks when they finally appear on the screen have sometimes ceased to be relevant:
the partner is already so far ahead that he either misinterprets them or completely
fails to understand them.
In addition, the wish arose to be able to converse in other parts of the house or
outside it where the machine is not available. For this purpose I use a piece of
cardboard on which the panel is copied, and with my head lamp I form words in the
same way as with the machine, so that while my conversation partner follows the
panel he "hears" what I have to say. It is quite a useful alternative, but does not
compare with conversing on the machine.
Discussion
For the user in question the typewriter operated by the directing of a lamp fixed to
his brow has proved to be a unique and fundamental aid. Practically from the very
beginning he was able to use the apparatus very easily and to express himself with it
in writing at speeds of the order of 60 strokes a minute and with very few mistakes.
More especially during the initial period the facilities for making corrections proved
to be important, since they gave the user the chance to correct his mistakes him
self before they could be noticed by others.
The apparatus provided some compensation for the lack of hand and arm control. When
at a later stage the patient gradually lost his power of speech, the typewriter
was also used for direct communication with a conversation partner. This kept the
user mentally active and motivated.
In what way can a lightspot-operated typewriter be brought within the reach of those
who need such an aid? On the basis of experience gained, preferably with several
prototypes and different users, it would be necessary to make modifications to the
design.
The important point is to make as much use as possible of commercially available
equipment. Components to be newly designed will have to be studied in the prototype
stage with a view to production costs, ruggedness, reliability and service. When the
128 prototype is ready for production it will have to be tested in practice by future users.
There will also be problems of finance, organisation and distribution to be solved,
and this will involve building up a large number of contacts (Bouma, Engel, Melotte,
1972; Melotte and Leopold, 1976).
In the present case some of these problems were recognised earlier, and some models
of an improved version of the prototype described in 1973 by Soede have already been
tried out in practice. Given the nature of the problems to be solved, it is evident
that a fairly long development process will be needed. However, since the prices of
the commercial equipment used here are steadily falling, and in view of the possibi
lity of a coordinated effort, it should be possible within a few years to make a
head-movement-operated typewriter generally available for handicapped persons in
need of such an aid.
As regards other capabilities of the equipment, a facility should be considered
for making it possible to add or erase something at any point in the display. This
would enable the user to produce a difficult text, requiring frequent recasting be
fore reaching its final form. Various calculations, such as addition and multipli
cation operations, might also be displayed and carried out on the monitor. Another
possibility would be to make provision for games such as chess, crossword puzzles
and even ludo to be played, after some practice, on the monitor.
Finally, there are questions concerning associated equipment or associated possibi
lities. In the first place, a photocell panel can also be used as an operating panel
for other aids. Fur example, patients unable to spell (children, for instance) would
be able to make wishes known by means of pictures, and acoustic signals might also
be transmitted, for example via previously stored speech signals. In this connection
we had an earlier idea of a limited set of "vital sounds" (one per key) which could
represent an extreme stage of communication. At a somewhat less extreme level, one
might consider remote control of various kinds of equipment.
On the panel described here one photocell operates a horn. This can reduce the
user's feeling of isolation when he is working by himself. It ~s a device that can
easily be expanded to meet the user's personal wishes. Examples are the signal for
operating the page-turning device, for automatically dialling a telephone number,
or for switching a television set, radio set or cassette recorder on and off.
Such facilities would give the handicapped a greater degree of self-reliance.
References
Bouma, H., Engel, F.L. and Melotte, H.E.M. (1972) Technological devices for thevisually handicapped: Gap between research effort and available aids, I.P.O.Annual Progress Report, 2, p.46-S4.
Soede, M., Stassen, H.G., van Lunteren, A. and Luitse, W.J. (1973) A lightspotoperated typewriter for severely physically handicapped patients, Ergonomics, ~,p.829-844.
Melotte, H.E.M. and Leopold, F.F. (1976) Development of aids for the perceptuallyhandicapped, I.P.O. Annual Progress Report, 11, p.116-119.
129
The IPO relief-drawing set
H.E.M. Malotte
Based on earlier experiences, an improved relief-dra~ing set has been developed by
the Institute of Perception Research. This design, using injection moulding technique,
is better suited for large-scale production, ~hich was in fact started in 1977.
By writing with some pressure with an ordinary ball-point, the IPQ relief-drawing
set makes it possible to produce durable, embossed dra~ings that are immediately
tangible at the written side of the special plastic dra~ing sets.
The major applications of the relief-drawing set are seen as:
- an educational aid for the visually handicapped, emphasising abstract functioning,
e.g. to be used for the learning of writing, drawing, mathematics, geography,
games, music, etc.,
a means for exchanging written messages between people with normal vision and
people with visual impairments and
- a scratch pad for the elderly blind who find it difficult to learn Braille.
The new relief-drawing set can be obtained from:
The Dutch Association for the Blind (VNBW),
130 Kipstraat 54, Rotterdam, The Netherlands.
IPO annual progressreport 12 /977
Instrumentation
131
Digital equipment: a number of examplesL.F. Willems, G. Moonen, C. Lammers, J. Dobek, A. van Nes and H. Jimenez Nichols·
Introduction
The character of electronic instrument design is in the process of drastic change
due to the breakthrough of digital MSI and LSI circuits in particular. It looks as
if analog circuitry is losing more and more ground: even the T.V. receiver and the
audio tape recorder are in the course of being digitalized. An important factor in
this process seems to be the design philosophy of a preference for digital solutions
to electronic problems.
The advantages of digital circuitry are often quoted and they are probably all valid.
But the question then arises: has analog circuitry still a future in the design of
instruments. We have no answer to this problem; we confess, however, that in our
laboratory we,too, are biased towards a digital design philosophy.
In this contribution we describe four apparatuses that are based on digital methods
which we could not imagine some years ago. The devices to be discussed are:
- A digital moduZator for accurate and stable modulation of two analog signals, to
be used in psychoacoustic experiments.- The varidac, a signal gate with a number of possible slope wave forms: linear
slopes, cosine-shaped slopes or gaussian slopes.
The digital audio Zoop is a digital memory for storing speech waveforms (with a
maximum duration of 1.6 sec.) and provides means for flexible retrieval of thewaveform.
- A four-formant speech synthesizer that uses MSI digital filters.
Accurate four-quadrant modulator
In psychoacoustic experiments, equipment has to meet very stringent requirements, one
of the weak links being modulators. Analog multipliers yield nonlinear distortion
which limits the performance of the experimental set-up. These difficulties can be
overcome by digitalization of the modulator. With off-the-shelf components anydesired accuracy of the modulator can then be obtained.
A block diagram of the modulator is shown in Fig. 1. The design is straightforward:
the input signals are sampled by a sample-and-hold switch and converted by a 12-bit
ADC to digital numbers; these numbers are digitally multiplied in an MSI circuit
(TRW:MPY-12AJ, the circuit being used quite inefficiently here). After multiplication,the result is rounded to a 12-bit number, which is then converted to an analog
voltage by means of a 12-bit DAC followed by a sample-and-hold circuit.
The timing of the various units in this modulator is taken from a PROM memory and
an address counter. The highest possible sampling frequency is 60 kHz. There are no
filters incorporated in the modulator.
132 * Student P.I.I.
IPO annual progress report 12 1977
DIGITAL
MULTIPLIER
MPV-12AJ
TIMING PROM
24 12ROUINDt-H"""" DAC
Fig. 1. Block diagram of the four-quadrant modulator.
The varidac
In psychoacoustic and speech perception experiments signal gates with a high on-off
ratio (at least 60 - 70 dB) are required. The most critical point in such a gate is
the signal leakage in the gate-closed condition. Analog circuits performing the
gating function therefore need accurate balancing to remove the influence of un
equal components in the circuit and the change in these components due to aging and
temperature. For the envelope signal generator the same demands exist: in the
gate-closed situation the dc voltage controlling the gate must stay within a few
millivolts of zero volt.
In the design of the varidac these problems were short-cut by using a multiplying
DAC and a digital envelope signal generator. A multiplying DAC is obtained by using
a DAC whose reference voltage (of both polarities) can be applied externally. In the
varidac this reference voltage input is used as an input for the analog audio signal.
The voltage is then multiplied by the digital number fed to the DAC in the usual way.
The envelope signal is generated by accumulating stored increments or decrements
from a PROM memory (1024 x 4 bits) in an ALU (Arithmetic Logic Unit). The clock
frequency of this process determines the duration of the slope of the envelope
signal. This duration can be chosen independently for the rise and fall of theenvelope.
The following waveforms can be chosen for the envelope signal:
- rectangular
- linear slopes
- cosine-shapes slopes
- gaussian-shaped slopes
- gate continuously open (for testing purposes).
A block diagram of the varidac is shown in Fig. 2. The clock frequency for the
timing process of the PROM is provided by an address counter. With the switches t1
and t3 the slope durations can be selected (2,5 ms to 40 ms), while with a switch 133
on the main clock divider all timing functions can be slowed down by a factor of 10.
The linear waveform is obtained by connecting the output of the address counter
directly to the multiplying DAC.
pf--INPUT
~- flDAC I-- LPFFOLLOWER
10 bit
f
ALUr+-
4bit
CLOCK ADDRESS
10MHz 25hsoI- PROMCOUNTER I-- f-+-
lj" 10t1 t3
SHAPE
~2
TIMINGPROM
I--
Fig. 2. Block diagram of the varidac.
Digital audio loop
A basic drawback of a tape loop mounted on a tape recorder for providing repeating
speech fragments is that it cannot be synchronised with other equipment. Storing
the speech waveform in a digital memory provides, among other possibilities, imme
diate access to any part of the recorded speech and, to a large extent, obviates
the synchronising problems.
In the design discussed here the speech waveform is quantised by means of a 10-bit
ADC and stored in a 16k x 10 bit memory. The obtained signal to quantisation noise
ratio also compares favourably to the signal to noise ratio of tape recorders.
Access to the speech waveform is obtained by addressing the digital memory and
routeing the speech samples to a DAC. A speech fragment of about 1.6 sec. can be
134 stored with a sampling frequency of 10 kHz.
A block diagram of the apparatus is shown in Fig. 3. The memory addressing is performed
with the aid of an address counter so that the successive memory locations can bereached. This address counter can be preset at a certain position by means of a
number of switches. There are two sets of these switches and a remote entry. At an
externally supplied starting pulse the address counter is loaded by the corresponding
set of switches and started at that position.
DIG. CASSETTE DIG. CASSETTE
CLOCK
16 K10 MEMORY
ADDRESS COUNTER
STOP
REMOTE
If (l}-----j"----_....
REMOTE
Fig. 3. Block diagram of the digital audio loop.
The output of the address counter is compared with the setting of a third set of
switches: a pulse is generated when the address counter reaches the value set by
these switches. This pulse can be used to stop the cycle, to restart the same cycle
or start another segment, or even to start an external device.
The digital audio loop can also be used as a variable delay for speech signals. To
this end the signals are stored in memory and read out again after the delay timehas elapsed. 135
There are no filters incorporated in the device, so that sampling frequency can bechosen freely.
A four-formant speech synthesizer
A digital speech synthesizer, based on four formant filters connected in series,
was built with the aid of MSI circuits, on each of which two digital filters are
realized (TMC 539).
The synthesizer has to produce speech from formant-coded speech data, with a sample
rate of 8 kHz. The coded speech data can be obtained favourably by using LPC methods
(cf. Vogten and Willems, this issue). In the synthesizer these formant data areused as coefficients for the digital filters.
A block diagram of the synthesizer is shown in Fig. 4. The two digital filters in
one circuit block require the serial storage of 120 bits of coefficient data, hence
each filter pair has an associated shiftregister 120 bits long.
NOISE
ENERATOR
F1 F2 F3 F4 DAC LPF
SAWTOOTH
GENERATOR
FO AN AV F1 81 F2 82 F3 83 F4 84
~ TIMING
COEFF. MEMORY ~ PROM
/'
INPUT
Fig. 4. Block diagram of the speech synthesizer.
For unvoiced speech a maximum-length-sequence generator is used as a noise source.
The sequence length being 215_1 samples. For voiced speech a sawtooth generator is
used, which is simply an accumulating register to which the pitch frequency {Fo) is
added, the register being reset as soon as the accumulated sum exceeds the sampling
136 frequency.
A controlled voice switchu.o. Schroder
Introduction
In our visual experiments stimuli are presented to a subject and we want to know
something about the processing time required by the subject. From the experimenter's
point of view the most convenient way would be to measure the response latency by
means of a push-button. However, this yields possible anticipation of the stimulus
and results in an extra burden on the subject carrying out the task.
From the subject's point of view a voice switch is more convenient and there is no
anticipation, but a voice switch responds sometimes to erroneous sounds, like closing
a door, breathing or opening of the lips of the subject. In spite of these diffi
culties we pick the voice switch to measure our response latencies, and in this con
tribution describe how we control the correctness of the measured response latencies,
and a number of testing experiments on the performance of the voice switch.
Form of the voice switchVersion I
In a first version of a voice switch control we started a 1 kHz sound when the
stimulus was presented to the subject. At the moment the voice switch detects a
voice onset, this 1 kHz sound stops and a control lamp is flashed on. During the
session the 1 kHz sound was recorded on the left channel of the tape recorder,
the voice of the subject on the right channel. After the session, the correctness
of the performance of the voice switch was checked by listening to the tape, in
correct response latencies were deleted from the data. It was noted that this control
method didn't work too well, for, listening to the tape once again previously un
detected errors of even SOD msec or more were found.
Version"
We changed our cheap dynamic microphone for an electret microphone and found that a
flat frequency response is very important for voice switch performance. The voice
switch control was also changed. Using a switch activated by the voice switch, the
microphone now switches to the other channel of the tape recorder on detection of a
voice response, and switches back on detection of the end of the response, that is
if there is no detected voice for 180 msec.
Therefore only the sounds that the voice switch reacts to are recorded on the right
hand channel, and when listening to the right-hand channel for incorrect response
latency measurements only the erroneous sound is audible.
As an example let the stimulus be 'hat', and the response of the subject for example
is "hat or fat". On the left channel are audible: The pushing of the button, the
noise of changing the stimulus, and the first part of the onset of the letter "h"
followed after a short silence by "or fat". On the right channel there is audible
"hat",or "at" if the voice switch didn't respond to the "h" quickly enough. It is
therefore also possible to obtain information on the performance of the voice switch 137
IPO annual progress report 12 1977
on letters like "h". The first aim of our control was to facilitate improvement and
testing of the voice switch, so that now, using this voice switch plus control, we
are able to exclude incorrect measurements (typically 75%) from the data in a veryconvenient way.
Testing the controlled voice switch
Three experiments have been done to get an idea of the errors introduced by usingthe voice switch.
During experiment I, randomly flashed lights were used and the subject had to res
pond each time directly after the flash, with a known word. This was repeated 40
times, then another word was chosen. For reference purposes the subjects were also
asked to tap with a pencil instead of responding with a word.Experiment II was almost the same as experiment I, however, the subject named alter
natedly one of two words. In experiment III, words were presented foveally in a
tachistoscope as long as the subject needed to respond to the word (words were random,
one out of five).
In the first two experiments the Dutch
words were: lak, hak, zak and pak; in
the third experiment: hak, zak, tak,
tik and pak.
The averaged response latencies are
shown in Fig. 1. There seems to be
little difference between the voice
switch response on the initial
letters 1, hand z, but the response
latencies on the letter p are, sur
prisingly, about 15 msec delayed.
This delay was expected on the res
ponse latencies of the letter hbecause there is an audible difference
between the onset of the "h" andthe moment the voice switch responds.
The conclusion is that measured voice
onset times are not uniquely related
to the availability of the response
of the subject. A clear dependence on
the first letter of the word exists,
taktikpakhak lak zak
words used
• reference: tapping with a pencil
20
40
ms60
-20
-40
up to say 60 msec for certain sub
jects. If the subject wishes to
pronounce the letter 'p', more time
is needed than for the letter 'h',
because in pronouncing the letter 'p' pressure must be built up in the lung, whichis not necessary for a letter like 'h'. Therefore an audible improvement of the
voice switch on the letter 'h' would result in even greater differences in response
latencies between 'p' or 'h' while, ideally, there should not be any difference at
Fig. 1. Differences in the measured responselatencies found in three experiments. Experimental points represent the average of fivesubjects.
138 all.
Behaviour of the latency distributions
Except for letters like 'p', there is no sharp onset of the voice, a gradual onset
causes fluctuations in the measured response latencies, while for the letter 'h'
the 'onset time' is 100 msec. It might therefore be expected that the voice switch
introduces an extra uncertainty in the response latencies, resulting in an increased spread of the distribution.
msexpI ms
80 s 80 s
~ereference:tapping with a pencil
~
60 60
40
20
hak lak zak pak
~ words used
40
20
hak tak pak tik
~ words used
expm
zak
Fig. 2a. The standard deviations (s) ofthe measured response times in experimentI for each of the five subjects werequite the same for all subjects and words.(The mean response latency was 250 ms.)
Fig. 2b. The standard deviations (s) ofthe response latencies in experimentIII were greater than in experiment I.(The mean response latency was 320 ms.)
In experiments I and II the standard deviation was about 3S msec for all words and
subjects (Fig. 2a) and the standard deviations of spoken words were S msec higher
than for the mechanical task. In experiment III the need for recognition of wordsintroduced a much higher increase in the standard deviation (Fig. 2b) than in ex
periments I and II.
Our conclusion is therefore that the voice switch errors are negligible in visual
recognition experiments where the standard deviations are 100 msec or more.
139
140
IPO publications
IPO publications 1977
P 309
P 310
P 311
P 312
P 313
P 314
H. Duifhuis
Cochlear Nonlinearity and Second Filter: Psychophysical Estimation of ModelParam~ters.
J.Acoust.Soc.Amer., 1976, ~, p. S39
Abstract of a paper contributed to the 92nd meeting of the Acoustical Society of America, held at San Diego, November 1976.
H. Duifhuis and W.F. Simons
The Critical Band: its Relation to Cochlear Nonlinearity and Second Filter.
J. Acoust.Soc.AJ1\er., 1976, ~, p. S39-S40.
Abstract of a paper contributed to the 92nd meeting of the Acoustical Society of America, held at San Diego, November 1976.
A.F.V. van Katwijk
Implicit Knowledge of Pitch Patterns in the Perception of Accented Syllables.
Lowen und Sprachtiger.Ed. Rudolf Kern, Leuven: Peeters 1976, p. 385-394.
It is not simple acoustical attributes which constitute the cues by whichnative speakers of Dutch pick out accented syllables in spoken language. Itis rather pitch movements that are integrated parts of intonation patternsthat elicit accent judgments consistently and non-trivially.
D.G. Bouwhuis
Recensie van Psycholinguistics door N.H. Markel (Ed.).
Massa Communicatie, 1976, i, p. 235-237.
F.1. Engel
Visual Conspicuity, Visual Search and Fixation Tendencies of the Eye.
Vision Res., 1977,..!.2, p. 95-108.
The cumulative probability of target discovery during search has been related experimentally to the relevant "conspicuity area", the visual fieldin which the target can be discovered after a single eye fixation. Duringsearch, "non-targets" were found to be fixated spontaneously in proportionto their conspicuity area.Further small spontaneous eye fluctuations are described that occurred,during determination of the conspicuity areas, in the direction of the targetdiscovered. Their occurrence and delay depended on the target eccentricityand the size of conspicuity area.The results emphasize the relevance of the conspicuity area to research onvisual selection.
Ch.P. Legein and H. Bouma
Dyslectic and Normally-Reading Children.I. Exploration of a Letter-Search Test for Screening PurposesII. Follow-up and further Exploration in 4 Weak and 4 Normal Readers on
Letter, Word and Number Recognition.
Documenta Ophthalmologica, 1977, il, p. 391-396.Exploratory experiments compare dyslectics to normally reading children. Ina letter search test, the dyslectics scored lower than the controls, butboth groups gave lower scores for letters in words than for letters in unpronounceable strings. As to reading level, both groups read better than oneyear earlier, but only the dyslectics increased in parafoveal word recognition.Short term memory scores for visually presented digits were worse in dyslectics, whereas for auditorily presented digits there was less difference.It is advocated to study component processes of reading not just in isolationbut rather in their mutual dynamic dependencies. 141
IPO annual progress report 12 1977
142
P 315
P 316
P 317
P 318
P 319
L.P.A.S. van NoordenMinimum Differences of Level and Frequency for Perceptual Fission of ToneSequences ABAB.
J. Acoust.Soc.Amer., 1977, B, p. 1041-1045.
Stream segregation or fission of the fast alternating tone sequence ABABis known to occur if there is a sufficient frequency difference between thetones A and B. In this paper it will be shown that level difference insteadof frequency difference can be sufficient to enable the occurrence of fission.The smallest level difference between A and B, ~L~3 dB (2.5-10 tones per sec;tone duration 40 msec). At rates faster than 12 tones per sec a new perceptivephenomenon was observed: the roll effect. It is characterized by the weaktones being heard at double the tempo. The relation with the continuity effectis investigated using alternating sequences with both level and frequencydifference between the tones as stimuli.
J. 't Hart
Vers une Base Psychophonetique de la Stylisation Intonative.
Actes des 8emes Journees d'Etude sur la Parole.Aix-en-Provence: GALF 1977, Vol. 1, p. 167-173.
Nous avons fait des experiences nombreuses dans lesquelles il apparaissaitpossible de styliser des courbes de la frequence fondamentale etonnammentfort, sans perdre "l'equivalence perceptuelle" avec les intonations originales.
Les limitations au pouvoir separateur pour des frequences changeantes rapidemment pourraient-elles expliquer cet effet?Des considerations sur Ie seuil differenti&l de l'etendue d'un mouvement deF , celui de sa position dans la syllabe, et celui de sa pente, en connexiona~ec Ie seuil de glissando montrent que les differences entre les courbes deF naturelles et leurs stylisations comme communement appliquees res tents8bliminales en grande partie.
R. Collier
La Perception de l'Intonation Anglaise par des Anglophones et Neerlandophones.
Actes des 8emes Journees d'Etude sur la Parole.Aix-en-Provence: GALF 1977, Vol. 1, p. 139-146.
Dans la perception de l'intonation, c.a.d. des variations du fondamental dela langue parlee, il est possible de faire abstraction de certaines differences physiques pour aboutir a une classification globale des contoursmelodiques en un nombre limite d'intonations de base. Cette faculte d'abstraction intonative a ete etudiee chez des sujets anglophones et neerlandophonesqui devaient ~rie~vingt phrases anglaises selon leur propre critere de ressemblance melodique.
D.J.H. Admiraal, B. L.Cardozo, G. Domburg and J.J.M. Neelen
Annoyance Due to Modulation Noise and Drop-Outs in Magnetic Sound Recording.
Philips Technical Review, 1977, iI, p. 29-37.
It is possible to carry out many kinds of physical measurements with greataccuracy on a product intended for human use and still not obtain a conclusiveanswer to the question of the product's usability. This is because humanperception also enters into the picture. If the investigation is extendedto include a representative number of human subjects it will be discovered,however, that human perception obeys certain laws. These can often be quantified, as has been done for example in the theory of the chromaticity diagramand in the international definitions of loudness. More particularly in thecontext of noise abatement, a further step has been taken and efforts havebeen made to express the concept of annoyance in numerical terms, leading toreproducible results. Something of the same sort is attempted in the articlebelow, which deals with the annoyance caused to the listener by two imperfections of magnetic sound recording that are hard to avoid: modulationnoise and the spontaneous occurrence of short interruptions or "drop-outs".
D.G. Bouwhuis
Recensie van: Leesbaarheid: onderscheiden, opnemen en verwerken doorJ.M. Dirken.
Massa Communicatie, 1977, ~, p. 152.
P 320
P 321
P 322
P 323
H. Duifhuis
Cochlear Nonlinearity and Second Filter: a Psychophysical Evaluation.
Psychophysics and Physiology of Hearing.Eds. E.F. Evans and J.P. Wilson, London :Academic Press 1977, p. 153-163.
The class of models consisting of a linear first filter, a time-invariantnonlinearity, and a linear second filter appears to be reasonably successfulin describing phenomena like sharpening, two-tone suppression and combinationtone generation. We have analysed such a model (J.Acoust.Soc.Amer., 59, 408423, 1976) and we were able to make, a.o., certain quantitative predictionsabout two-tone suppression. The response of the model to a probe + maskercomplex (as a function of masker level; probe level fixed; masker off CFand probe at CF) would contain separate quantitative information on firstand second filter as well as on nonlinearity. We measured such responsespsychoacoustically using the pulsation threshold. The observed results showobvious similarity to the expected results. Amounts of suppression of over30 dB have been measured. However, we also found certain systematic deviationsbetween data and theory. These seem to indicate that at any rate the assumptionthat the first filter is linear, is questionable. Thus, a confrontation ofthe predicted results with the present data at best gives a crude firstorder approximation of the parameters to be estimated.
B. Leshowitz and R. Lindstrom
Measurement of Nonlinearities in Listeners with Sensorineural Hearing Loss.
Psychophysics and Physiology of Hearing.Eds. E.F. Evans and J.P. Wilson, London: Academic Press 1977, p. 283-293.
Nonlinearities characterising the auditory system with cochlear pathologyare examined in several investigations of the psychoacoustical tuning curve.In regions of threshold elevation, the curve is extremely broad due to thedisappearance of the finely-tuned segment and displays a "notch" in thevicinity of the probe corresponding to the greatly diminished effectivenessof the masker in this region. For observers with abrupt high-frequency loss,combination tones are not evident for placement of masker and probe in regionsof normal hearing. A concommitant of this abnormal frequency response is amarked reduction of neural suppression revealed in forward-masking experiments with bandlimited noise. The data are consistent with a physiologicalvulnerability of the second filter.
H. Bouma
Visuele Maskering in het Leesproces.
Z.W.O. Jaarboek 1976, p. 112-119.
P.A. Vroon, H. Timmers and S. Tempelaars
On the Hemispheric Representation of Time.
Attention and Performance, vol. 6.Ed. S. Dornic, Hillsdale: Erlbaum 1977, p. 231-245.
It is well known that subjective duration is related to cognitive processes.Also, differences between the information processing and analysing systemsof both cerebral hemispheres have been reported. This study attempts todetermine whether or not the left brain is superior in the encoding of time.In Experiment I the subjects reconstructed the durations of simple reactiontime tasks carried out predominantly by either the left or the right brainhalf in the visual and auditory modality. It appeared that the variances ofthe time estimates of the right brain considerably exceeded those of the left.Consequently, there is a relatively great time uncertainty.
P 324 H. Timmers and W.A. Wagenaar
Inverse Statistics and Misperception of Exponential Growth.
Perception and Psychophysics, 1977, ll, p. 558-562.
Exponential growth presented by numerical series or graphs is grossly underestimated by human subjects. This misperception was considerably lessenedby presenting decreasing functions; this conclusion holds for both numericand graphic stimuli. In the numerical conditions about 25% of the subjectsperformed according to the statistical norm. In contrast with previous results, considerable individual differences with respect to sensitivity forrate of growth were observed. This finding was interpreted in terms of task 143
P 325
difficulty: Extrapolation of ascending series is too difficult a task to bediscriminative. Extrapolation of descending series is much easier, and maytherefore better discriminate among subjects.
H.C. Bunt
The Formal Semantics of Mass Terms.
Papers from the third Scandinavian Conference of Linguistics, held atHanasaari, October 1-3, 1976.Ed. F. Karlsson, Turku: Academy of Finland 1976, p. 81-94.
The paper discusses the basic conceptual and formal problems connected withthe semantic interpretation of mass terms. A conceptual analysis is presented which departs from the classical analysis due to Quine. The conceptual analysis which is put forward instead is formalised in terms of anewly developed extension of set theory, called ensemble theory. An axiomatic formulation of ensemble theory and a survey of its most important theorems are given in an appendix. It is shown that by analysing mass terms inthe way proposed here many of the notorious semantic problems connectedwith them can be solved. The best developed alternative approach, due toParsons is shown to be fundamentally inadequate and inconsistent.
144
Papers accepted for publication
MS 286 L.L.M. Vogten
Simultaneous Pure-Tone Masking: the Dependence of Masking Asymmetries onIntensity.
To appear in: J. Acoust. Soc. Amer.
Phase locking between probe and masker was used in a series of pure-tonemasking experiments. The masker was a stationary sine wave of variablefrequency; the probe a fixed-frequency tone burst. We have observed thatfor small frequency separation the masking behaves asymmetrically aroundthe probe frequency. This asymmetry depends on intensity. For a 1 kHz probeat low stimulus levels there is a maximum masking effect at about 60 Hzabove the probe frequency, whereas at high levels maximum masking is produced at a frequency definitely below the probe frequency. These resultsare discussed in relation to current neurophysiological and psychophysicaldata. For the high-level asymmetry possible interpretations are suggestedin terms of two changes in the excitation pattern of the basilar membrane,(a) a shift of the top and/or (b) a slope asymmetry, both increasing withlevel. The low-level asymmetry will be treated in a second paper.
MS 294 L.L.M. Vogten
Low-level Pure-Tone Masking: A Comparison of "Tuning Curves" obtainedwith Simultaneous and Forward Masking.
To appear in: J. Acoust. Soc. Amer.
Simultaneous and forward pure-tone masking are compared, using a fixedlevel probe of 20 msec and a 200 msec masker. For a 1 kHz probe of 30 dBSPL the required masker level, Lm, is measured as a function of the timeinterval L'lt between masker offset' and probe onset. When masker and probehave equal frequencies a monotonic relationship is found for phase ~n
but not for phase O. When the masker frequency, f , is 50 or 100 Hz beZo~
the probe frequency, f , a nonmonotony is found, With a minimum at L'lt = 0,the transition betweenPsimultaneous and forward masking. When f is 50 or100 Hz above f , however, the relationship of L to L'lt is monot~nic.In the case ofPsimultaneous masking the iso-L ~urves, which give L as afunction of f , show a typical asymmetry aroBnd f = f , leading t~ thepositive shiftmof the maximum masking frequency, MffiF, p¥eviously reportedfor stationary pure-tone maskers. In the case of forward masking, however,this asymmetry ceases to exist. We conclude that simultaneity of probe andmasker is a necessary condition for the occurrence of a low-level positiveMMF shift.
MS 300
MS 310
MS 312
The results are discussed in the light of psychoacoust1cal and neurophysiological data on two-tone suppression. A possible interpretation of the nonmonotony and of the positive MMF shift is suggested in terms of the physiological asymmetry in two-tone suppression.
S.G. Nooteboom, J.P.L. Brokx and J.J. de Rooij
Contributions of Prosody to Speech Perception.
To appear in: Studies in Language Perception; Proceedings of the Symposiumon Language Perception, held at Paris, July 18-25, 1976.
In this paper empirical data relating contributions of prosody to recognition in speech perception are discussed. A first type of data 1S concerned with the perception of prosodic continuity of the attended voice.A sequence of speech sounds mayor may not be heard as a continuous streamof speech produced by a single voice. It is shown that both continuity inperiodicity pitch and continuity in spectral composition contribute to perceived prosodic continuity.A second type of data has to do with a contribution of speech rate in theimmediate environment of a test segment to the phonemic perception of thissegment as a short or a long phoneme. The data can be explained by assumingthe existence of backward perceptual normalisation of segment duration tothe temporal structure of auditory information coming later in time.A third type of evidence is related to the contributions of prosody of theperception of specific linguistic information. Speech prosody potentiallycarries information on lexical, syntactic and semantic aspects of themessage. A number of investigations both in our own institute and othersreported in the literature show that listeners may actually use this information when they need to.
H. Bunt
Ensembles and the Formal Semantic Properties of Mass Terms.
To appear in: Mass Terms: Some Philosophical Problems.F.J. Pelletier (Ed.), Dordrecht: Reidel 1978.
This paper presents an analysis of the formal, i.e. non-lexical semanticproperties of mass terms, which forms the theoretical basis for thehandling of mass noun expressions in the PHLIQA 1 question answeringsystem.The paper starts out with a discussion of what distinguishes mass nounsfrom count nouns, from a syntactic as well as from a semantic point ofview. Existing proposals to define this distinction are critically examined.As defining characteristic of mass nouns is proposed the property ofhomogeneous reference: a mass noun refers in such a way that no particulararticulation of the referent into parts is presupposed, nor the existenceof minimal parts. To the notion of homogeneous reference a precise meaningis given by formalizing it in terms of a mathematical formalism, developedespecially for the study of mass terms, called ensenible theory. Ensembletheory deals with mathematical objects, called ensembles, that are characterized by their parts (sets are special cases of ensembles). The paper contains an informal introduction to ensemble theory and a listing of theaxioms on which the theory is founded. Mass adjectives are defined as thoseadjectives that denote a homogeneous property, a notion which is also madeprecise in terms of ensemble theory.Based on the notion of an ensemble, a formal language for the semanticrepresentation of expressions containing mass terms is defined. A simpletheory of amounts, containing just the minimal ingredients needed fordealing with quantified mass noun expressions, is also incorporated inthis language. Using the logical properties of ensembles and amounts,it is shown that the proposed semantic representations of expressionscontaining mass terms do account for the formal semantic properties ofsuch expressions.
S.G. Nooteboom
Perceptual Adjustment to Speech Rate: a Case of Backward PerceptualNormalization.
To appear in: Album Hendrik Mol.Amsterdam: Institute of Phonetics.
The effect of speech rate on the phonemic perception of Dutch /a/ and/a:/ as a function of vowel duration is studied. The test segment was the 145
146
MS 314
MS 317
MS 320
vowel of the Dutch word taak (task), which, when shortened, is perceivedas tak (branch). The word was embedded in a sentence. The speech rates ofthe preceding part and of the following part of this sentence were variedindependently. It was found that an increase in speech rate of the followingpart of the sentence consistently leads to a decrease of the phonemeboundary in milliseconds. A decrease of speech rate had no such effect.In no case did the speech rate of the preceding part of the sentence haveany systematic effect on the phoneme boundary. The data are explained interms of the relative importance of the most recent auditory informationto perceptual normalization, and of an increase in the relative importanceof spectral cues as compared to acoustic duration in slowed down speech.
H. Bouma and F.L. van Nes
De Leesbaarheid van Lijnsegment Cijfers op Displays.
To appear in: Nederlands Tijdschrift voor de Psychologie en haar Grensgebieden.
Technical and economical developments have led to a rapidly increasing useof new media for text presentation. Little is known about the legibilityof the letter- and number shapes which are used for such electronic media.This paper deals with research on the recognizability of numbers built upfrom straight line segments, which therefore have a schematic form.Erroneous recognition of such numbers leads to confusion between them.The distinctive function of the individual line segments has been determinedfrom the errors. This analysis leads to an improved design of the numbershapes. Firstly, the improvements aim at increasing the discriminabilityof the numbers; secondly, improvement of their acceptability, i.e. resemblance with the usual number shapes, plays a role.
J. 't Hart
Looking for Rhythmical Structures Evoked by Isochronous Syllable Strings.
To appear in: Album Hendrik Mol.Amsterdam: Institute of Phonetics.
In 1975, we did an experiment on the relation between the location of thenon-final fall and syntactic structure. Stimuli were strings of "hummedsyllables" with pitch contours in which the location of the non-final fallwas varied. Subjects responded by making sentences to which the given pitchcontours would provide suitable fits.On intuitive grounds, it was later supposed that the response material contained systematic phenomena with respect to rhythmical organisation. Inparticular, the way in which rather long stretches of syllables betweenpitch accents seemed to be subdivided suggested a relation to total numberof syllables, location of pitch accents and of the non-final fall.An attempt has been made to find formal criteria according to which subdivisions can be assigned on the basis of the language material alone. Fairagreement with the intuitive approach can be obtained when using pitchaccents, lexical stresses and the alternation of full and neutral vowels,to score on a 'light-heavy' scale. The subdivisons, now based on thesescores, show essentially the same relations as mentioned above.It is concluded that intonation and total number of syllables of the stimulus material have elicited particular rhythmical organisations of theresponse sentences and that these organisations are reflected in thelanguage material chosen by the subjects. Meanwhile, their systematictrends seem to imply that the formal criteria applied to recover the rhythmicorganisation correspond to some psychological reality.
S.G. Nooteboom
Speaking and Unspeaking: Detection and Correction of Phonological and LexicalErrors in Spontaneous Speech.
Paper submitted to the Working Group on "Slips of the Tongue and Ear";12th International Congress of Linguists, August 31 - September 2, 1977.
An analysis of corrections of phonological and lexical speech errors inMeringer's corpus shows that: (1) most speech errors are corrected, phonological errors slightly more than lexical ones; (2) stops for a newstart are predominantly made at the first word boundary after the error,later stops being more frequent for lexical than for phonological errors;(3) inphonological errors new starts practically always go back to thelast word boundary preceding the error, in lexical errors often further.To explain these data a mental strategy is hypothesized which checks the
M5 321
M5 322
output speech for phonological orthodoxy of word forms, and the syntacticand semantic appropriateness of short phrases.
Ch.P. Legein and H. Bouma
Leesprocessen bij Leeszwakke Kinderen.
Verschijnt in: Proeven op de som, Psychonomie in het Dagelijks Leven.Janssen, Vroon en Wagenaar (Red.), Deventer: van Loghum 5laterus.
Wanneer men bij leeszwakke kinderen het lezen onderzoekt, blijkt niet aIleen dat zij opvallend trager reageren dan normaal lezende kinderen, maarook dat zij een kleiner leesgezichtsveld hebben. Deze gegevens zoudenwellicht van nut kunnen zijn bij het ontwikkelen van leestrainingsprogramma's.
H.E.M. Melotte en F.L. Engel
De IPO Relief-Tekenmap; van Idee tot Hulpmiddel.
Verschijnt in: Maandblad voor Revalidatie.
Korte beschrijving van de zogenaamde Relief-tekenmap; een op het IPO ontwikkeld communicatie hulpmiddel waarmee blinden en slechtzienden in blijvend voelbaar en zichtbaar relief kunnen tekenen en schrijven.
Reprints and preprints of IPO publications
Requests for reprints or preprints of publications listed on pages 141-147can be obtained from:
Library
Institute for Perception Research
P.O. Box 5135612 AZ Eindhoven
The Netherlands
147