Download - IPO ANNUAL - Eindhoven University of Technologyalexandria.tue.nl › tijdschrift › IPO 12.pdf · regards perceptual consequences. Vision luminance contrasts We aim at a quantitative

IPO ANNUAL PROGRESS REPORT

Nr.12 1977

Editor: A.J. Breimer

Typist: Jeanneke van Esch

INSTITUTE FOR PERCEPTION RESEARCH - INSTITUUT VOOR PERCEPTIE ONDERZOEK

P.O. BOX 513 EINDHOVEN HOLLAND

TELEPHONE NATIONAL (040) 756605 / TELEPHONE INTERNATIONAL +3140 756605

Contentspage

2

5

6

8

11

24

29

34

41

47

55

58

63

69

74

Contents

Introduction

Research programme

Organisation IPO

Auditory perception

B. Leshowitz

Speech intelligibility in noise for listeners with sensorineural

hearing damage.

H. Duifhuis, J. Smits, J. v.d. Vorst and M. Scheffers

Further psychophysical data on two-tone suppression.

J. Thomassen

Preliminary experiments on accent perception in tone sequences.

B.L. Cardozo and K.G. van der Veen

Estimation of annoyance due to low-level sound.

Speech

H.F. Muller, S.G. Nooteboom and L.F. Willems

An experimental system for man-machine communication by means of

speech.

L.L.M. Vogten and L.F. Willems

The Formator: a speech analysis-synthesis system based on formant

extraction from linear prediction coefficients.

J. 't Hart

Pitch contour stylisation on a high-quality analysis-resynthesis

system.

A.F.V. van Katwijk

Auditory feedback as a factor in disrupted speech production.

D. Bouwhuis and J. de Rooij

Vowel length and the perception of prosodic boundaries.

R. Collier

The perception of English intonation by Dutch and English listeners.

S.M. Marcus

The IPO speech squeezing system.

IPO annua1 progress report 12 1977

81

87

92

96

100

105

115

11 7

124

130

132

137

141

Visual perception

F.J.J. Blommaert

Spatial processing of small visual stimuli.

H. Bouma, Ch.P. Legein and A.L.M. van RensVisual recognition by dyslectic children: response latencies for

letters and words.

U.O. SchroderBackward masking in a reading-like situation.

H. Timmers

Letter cancellation in words and nonwords.

H. Bouma, D.G. Bouwhuis and H. Timmers

Processing of visible language: a Symposium.

Cognition

H.C. Bunt

Towards an analysis of dialogue organization principles.

Knowledge of Dutch three-letter words.

CRec t i fi cation)

Ergonomics and perceptual aids

H. Bouma and F.L. van Nes

Legibility of rectilinear digits.

P.H. van der Heijden, H. Bouma, H.E.M. Melotte and F. Meyer

A typewriter for a motorically handicapped person, operated by

head movements.

H.E.M. Melotte

The IPO relief-drawing set.

Instrumentation

L.F. Willems, G. Moonen, C. Lammers, J. Dobek, A. van Nes and

H. Jimenez Nichols

Digital equipment: a number of examples.

U.O. Schroder

A controlled voice switch.

Publications

4

Introduction

This report or any part thereof may not be reproduced in any form wi thout the written

permission of the Institute for Perception Research. Illustrations may be reproduced

only with explicit mentioning of source; copies will be appreciated.

Introduction

The I.P.O. Annual Progress Report for 1977 greets its friends and colleagues and

presents them with a summary of the Institute's research in human perception and in

formation-handling in a technological era.

In November 1977, Dr. J.H. Bannier retired from the Supervisory Board. As a delegate

of Z.W.O., the Netherlands Organisation for the Advancement of Pure Research, the

Institute, from its foundation in 1957, has benefited from his wisdom. We express our

gratitude for his contributions to our work which have helped to establish and main

tain our links with related research groups. Dr. Bannier has been succeeded by

Mr. J. Smits, member of the Z.W.O. staff.

Thanks are due to all I.P.O. members who left us in 1977. Among them Mr. J.Chr.

Valbracht retired. He has provided us throughout the years with so many pieces of

electronic equipment.

The serious lack of accommodation was somewhat eased when a new wing in a neighbouring

building was put into use in August 1977. The new corridor between the two buildings

has changed the outward appearance of the whole.

Trends in our fields of interest are reflected both in the contributions to the pre

sent issue and in the research programme, listed on page 6-7.

Dr. Barry H. Leshowitz, during his sabbatical, helped us step up our research on

speech communication by the hard-of-hearing in noisy environments. Work has also

been intensified on the interaction between man and machine when using speech in

both directions and on perceptual properties of visual displays.

In the last named field,the Institute organised a symposium on "Processing of

Visible Language", from 5-8 September 1977. Research psychologists, graphic designers

and display engineers assembled to exchange views on the links between increased

understanding on the part of fundamental research and the ways of reading imposed

by present technology. A brief reflection on the symposium will be found on page 100.

As before, we shall be happy to maintain and extend contacts and cooperation with

our colleagues concerned in auditory and visual communication.

H. Bouma

IPO annual progress report IZ 1977

6

Research P'rogramme 1977/1978

IPO research is generally directed towards understanding intake and processing of

information by humans, in particular when they make use of technological means. Opti

mal design and usage of such technology are increasingly dependent on such insights.

Hearing

cochlear transduction Nonlinear processes in the cochlea are being studied

quantitatively by psycho-acoustic experiments such as masking for both

normal and impaired hearing. In the latter case, certain deviating types of

masking have been found which have implications for hearing aids.

sound control Perceptual evaluation of pleasant sounds, such as music, gives

rise to certain transmission and recording requirements. Of unpleasant

sounds ('noise') it leads to requirements for loudness and other sound

quali ties.

musical accents Properties of tone sequences leading to musical accents

are being studied.

Speech

connected speech Consequences of the available description of Dutch intonation patterns for pronunciation teaching are being considered. An at

tempt has also been made to develop a similar description of the intona

tion of British English. In a new approach the communicative value of pitch

accents will be considered. The relevance of time and frequency properties

of speech to speech communication are being studied.

word recognition The description of acoustic attributes used in the re

cognition of spoken Dutch words will be attempted.

speech processing Various facilities for analysis, synthesis and editing

of speech are available. The rapidly expanding technology for hardware and

software processing has made us step up our effort in this field, also as

regards perceptual consequences.

Vision

luminance contrasts We aim at a quantitative understanding of interactions

of stimuli close together in time and space. In the time domain, pulse-,

step- and frequency responses ar~ considered, in the space domain pointspread and edge-spread functions. Combinations of temporal and spatial

changes in the visual field are also being given attention.

image quality On the guidance of basic transfer functions of the visual

system, physical parameters of electronic displays are being considered

in order to understand and improve display quality.

IPO annual progress report 12 1977

visual selection We are concentrating on visual conspicui ty, defined asproperties of stimulus and surround and the strategy of visual searchconnected with it.

Reading

Reading processes are being investigated both in optimally presented text

and electronic text displays where factors such as size and contrast are

often not optimal. For dyslectic children, we study the coordination of

information from the two eyes.

Cognition and Communication

word recognition In recogni tion processes, perceptual analysis combines

with available knowledge. We are studying availability of words outside

context for incorporation in a quantitative theory of the visual recognition

of short words.

informational dialogues To obtain well-defined information from a rich

source, short series of questions and answers may be used. The structure

of such dialogues is being considered in terms of semantic notions. We

also study practical dialogues for communication between users and in

formation automata.

Ergonomics

Research is directed at anticipating ergonomic consequences of the appli

cation of new technologies such as in man-computer interaction. The develop

ment of certain new types of industrial product is being supported.

Aids for the handicapped

Certain new communication aids for people with visual, auditory or motor

handicaps are being initiated, developed and tested. Using existing pro~

duction and distribution channels, we work towards the goal of bringing

aids of proved usefulness within the reach of all who need them.

7

Organisation IPO

Dr. Ir. P. EijkhoffDr. J.P. van de GeerDr. H.E. Henkes

Dr. L.F.W. de Klerk

Dr. S.L. Kwee

Dr. W.J.M. Levelt

Dr. Ir. R. PlompIr. a. Rademaker

Dr. R.J. Ritsma

Dr. H. Schul tink

Dr. Ir. H. SpekreijseDr. P.C. Veenstra

Supervisory board(31.12.1977)

Scientific board(31.12.1977)

Director

Deputy director

Adviser

Prof. Dr. C.E. Mulders (chairman)Prof. Dr. W.A.T. Meuwese

Prof. Dr. J.F. SchoutenDrs. J. Smits

Dr. Ir. K. Teer

Prof. Dr. H.B.G. Casimir (chairman)

Prof. Ir. R.G. Boiten

Prof.

Prof.

Prof.

Prof.

Prof.

Prof.

Prof.

Prof.

Prof.

Prof.

Prof.

Prof.

Prof. Dr. C.J.D.M. Verhagen

Dr. Ir. P.L. Walraven

Dr. P.A. van Wely

Prof. Dr. P.J. Willems

Prof. Dr. Ir. A. van Wijngaarden

Dr. H. Bouma

Drs. B. L.Cardozo

Prof. Dr. A. Cohen

- Heeze

- Delft

- Eindhoven- Leiden

- Rotterdam

- Tilburg

- Eindhoven

- Nijmegen

- Soesterberg

- Eindhoven

- Groningen

- Utrecht

- Amsterdam

- Eindhoven

- Delft

- Soesterberg

- Eindhoven

- Tilburg

- Amsterdam

- Utrecht

Group leaders Dr. Ir. H. DuifhuiS)Dr. S.G. Nooteboom

Dr. Ir. J.A.J. Roufs

Drs. D.G. Bouwhuis

Ing. F.F. Leopold

H. E.M. Melotte

Ir. L.F. Willems

- Hearing and Speech

- Vision and Reading

- Cognition and Communication

- Ergonomics

- Communication Aids for the Handicapped

- Instrumentation

8

Research associates Ing. H.J. BleilevenIr. F.J.J. Blommaert

P.M. Boers

Ir. A.J. Breimer

Ir. J.P.L. Brokx (z.w.a. *)

Drs. H.C. Bunt

/PO annua / progress report /2 /977

Research staff

Secretaries

Ubrary

Workshop

Dr. R. Collier (part-time)J. 't Hart

lng. Th.A. de Jong

Dr. A.F.V. van Katwijk

Dr. Ch.P. Legein (part-time)Prof. B.H. Leshowitz+(for 1 year from Arizona State University)

Dr. S.M. Marcus

lng. G.J.J. MoonenH.F. Muller

Dr. lr. F.L. van NesDrs. J.R. de Pijper (Z.W.O. *)Drs. J.J. de Rooij+ (Z.W.O. *)

lr. U.O. Schr6der (Z.W.O.*)

Drs. J.M.E.W. Thomassen (Z.W.O.*)Drs. H. Timmerslng. J.C. Valbracht+

lr. L.L.M. Vogten

lng. E. de Braal

lng. J.J.G.M. Dobek

G.J.N. Doodemanlng. J.e. JacobsC.A. Lammers

A.W.J.J. MelchersA.C. van Neslng. J.A. Pellegrino van Stuyvenberg

lng. J. PolstraA.L.M. van Rens+

K.G. van der Veenlng. P. Ytsma

H.W. Zelle

Ms. M.A. Boerrigter

Mrs. J.A.C.E. van Esch-van der VleutenMrs. C.J. Mennen-SenkeldamMrs. C.E.A.L. Nuys-van de Water

Ms. R.M. Smith

C.G. Basten

J.H. Bolkestein

P.A.N. BroekmansA.L.M. de CocqC.Th.P. Godschalx+

D.J. van der Wees

+ Left during 1977

* Netherlands Organization for the Advancement of Pure Research 9

10

Auditory perception

Speech intelligibility in noise for listeners with sensorineuralhearing damage

B. Leshowitz

Introduction

The great majority of hearing-impaired listeners suffer from physiological deficits

present at the sensory cell level in the cochlea. These disorders are often termed

"sensorineural" and cannot be medically treated either by surgery or drug therapy.

The only assistance available to listeners with sensorineural hearing damage is

amplification of the sound afforded by the personal hearing aid. Unfortunately,

efforts to ameliorate hearing disorders with amplification have not been entirely

successful. Indeed, almost every listener with sensorineural hearing loss reports

that, under many everyday conditions of background noise, speech reception is not

improved by the hearing aid.

The foregoing does not contradict the audiological observation that the hearing aid

often performs a useful function. Under the restricted condition of near total quiet,

amplification of the sound, consisting mainly of the desired speech signal, above

the threshold of audibility does allow good speech reception. How much of the listener's

acoustical day consists of substantial quiet is of course, the fundamental issue.

Examination of the noisy communication situation experienced by the hearing-impaired

listener reveals a straightforward explanation of the widespread dissatisfaction

with the personal hearing aid. It is well-known that a concomitant of sensorineural

hearing loss is an increase in the ratio of speech level to noise level required

for just intelligible speech over that measured for the normal-hearing listener.

Measured in terms of speech-to-noise ratio, the hearing deficit may be as large as

10 to 15 dB (Plomp, 1978). When it is realized that in many common moderately noisy

communication situations, the normal listener must process speech at or near the

threshold of intelligibility, one can begin to understand the magnitude of the com

munication problem faced by the listener with impaired-hearing. The personal hearing

aid cannot possibly improve the quality of partially masked speech since it provides

indiscriminate amplification of both speech and noise. Thus, while the listener with

a hearing deficit may report the presence of a speech signal, unless the speech-to

noise ratio exceeds some critical value, the listener will perceive the speech as

muffled and unintelligible. Effectively he will be deaf in all but the most ideal

listening conditions.

An auditory disability is generally assumed to be present when the listener cannot

engage in tete-a-tete conversation in quiet. Without minimizing this hearing activity,

a cursory analysis of the typical listener's acoustical day reveals that the majority

of speech communication takes place in ambient noise. Assessment of hearing capacities

from pure-tone threshold and speech intelligibility measurements made in quiet can

therefore hardly be expected to capture the everyday acoustical experiences of the

hearing-impaired listener. 11


Implicit in our emphasis on measurement of speech reception under realistic conditions

of moderate background noise is the assumption that the standard evaluation of hearing

in terms of the audiogram provides an inaccurate description of the listener's audi

tory capacities. On the assumption that threshold elevation in the speech region

measured in the quiet is the best predictor of speech reception, it is common audio

logical practice to evaluate the hearing impairment by averaging the pure-tone thres

holds at 500, 1000 and 2000 Hz. While there is little doubt that this index of hearing

loss does indeed predict the speech reception threshold in quiet, there is much less

justification for its continued application under the less ideal conditions of mo

derate background noise. That listeners with normal low-frequency hearing and se

lective high-frequency hearing loss often experience great difficulties in perceiving

speech in noise has frequently been reported in the anecdotal observations of practi

cing audiologists (Courtois, 1975). Specifically, listeners with an abrupt high

frequency loss due to noise trauma as well as the presbyacusic patient characterized

by a more gradually sloping audiogram are the two major categories of the hearing

impaired population thought to be especially vulnerable to noise.

The major aim of the present investigation was to determine whether there is a well

defined class of listeners incapacitated by their inability to understand speech

in noise to an extent far in excess of what would have been predicted from inspection

of the audiogram.

Positive findings would, it was felt, provide strong evidence that hearing handicap

is far more prevalent in the general population than has heretofore been recognized.

Having established the existence of an appreciable population of noise-sensitive

listeners, efforts could then be directed to developing a psychoacoustical framework

for understanding the speech communication handicap.

In the audiological literature the claim is often made that there is an increase in

the upward spread of masking for listeners with sensorineural hearing deficits. Thus,

we speculated that increase in the speech-to-noise ratio observed for listeners with

selective high-frequency ~earing loss ought to be accompanied by a marked increase

in masking above that measured for normal-hearing listeners. Moreover, in view of

the speech communication handicap reported for these listeners, the enhanced masking

effect ought to take place in the low-frequency region of speech where pure-tone

thresholds are normal. A positive relationship between the pure-tone masking pattern

and speech reception in noise would, we reasoned, not only have obvious practical

implications for audiological procedures for assessing the effective hearing handi

cap, but would also constitute an intriguing research problem for the physiological

acoustician interested in basic hearing mechanisms. Although the physiological

correlates of behavioural thresholds are not well established, it is not unreasonable

to infer minimal sensory-cell loss in regions of normal hearing. Accepting this

supposition, then what underlying processes are responsible for the abnormal supra

threshold effects observed in seemingly normal regions of hearing?

Experiments

The primary strategy of the research was to assess intelligibility of speech presented

12 against a noise background for listeners with selective high-frequency hearing loss

and near-normal hearing in the speech frequencies. A second goal was to relate speech

reception in noise to the pattern of pure tone masking for individual listeners.

Subjects

Listeners with predominantly sensorineural hearing damage served as the experimental

group. When the air-bone gap exceeded 20 dB, the conductive loss was assumed to be

sufficiently great to eliminate the subject from participation in the experiment.

The pattern of hearing revealed by the audiogram in conjunction with the listener's

case history provided the basis for classifying the experimental subjects according

to etiology of the hearing loss. Listeners categorized as having "noise trauma (T)",

for example showed a precipitous loss in the high frequencies and reported that they

had been exposed to appreciable levels of environmental noise. Presbyacusic subjects

(P) had a more gradually sloping audiogram and had no history of noise exposure. Two

female subjects had received extensive medical treatment with a mycine drug, and

w~re classified accordingly (D).

The control group (N) consisted of listeners in the under-40 age group with normal

audiograms. A second group (over-40) produced largely similar data often with a

slight trend to typical presbyacusic results; those results are not presented in this

paper.

Typical audiograms are shown in the results section.

Procedure

Speech intelligibility was measured for each listener using connected discourse as

the signal. The task of the subject was to adjust the level of the interfering back

ground (foyer noise) until the speech signal was "just intellirible If. The listener

was informed that it was not necessary to recognize every word of the text, but only

to maintain the intelligibility of the story-line. A Bekesy up-and-down psychophysical

method was used for this purpose, wherein the subject continuously varied the level

of the background noise using an attenuator having a range of 40 dB and steps of 2 dB.

The level of the speech constituted the experimental variable and was varied between

the listener's speech reception threshold in quiet and 90 dB(A). (For a more complete

explanation of the application of the adjustment procedure to measurement of speech

intelligibility, the reader should consult the recent paper of Plomp (1976).)

The extent of the upward spread of masking for each listener was assessed in a pure

tone masking experiment. In this paradigm, the masker was a 125 msec, ramps included,

gated sinusoid, presented with a linear 25-msec rise-fall time. The masker was pre

sented continuously throughout the listening session with a duty cycle of 0.5. The

signal was a 20-msec tone burst which was shaped with a 10-msec rise-fall time. The

signal was presented in the temporal centre of the gated masker during three successive

masker bursts and deleted every fourth burst. A listening session was devoted to

the measurement of a masked audiogram. The frequency and intensity of the masker

was held constant. The experimental variable was the frequency separation between

signal and masker. 13

In all experiments, the tonal masker was a 1000 Hz sinusoid, presented at either 75

or 105 dB SPL. Masked thresholds were measured using an adjustment procedure similar

to that described earlier in connection with measurement of speech intelligibility.

The subject was instructed to adjust the level of the probe using a 2-dB-step

attenuator until the probe was at the "threshold" of audibility. Since the intent

of the experiment was to examine the spread of masking into the higher frequencies,

precautions had to be taken to prevent the listener from basing his threshold

measurements on perception of low-frequency combination tones caused by simultaneous

presentation of signal and masker. Thus, at frequency ratios of signal (fs ) to

masker (fm) between 1.2 and 1.5 an additional low-frequency masking tone at either

2f - f or f -f was added. The level of these added tones was at least 20 dBm ssmbelow the primary masker and therefore played a negligible role in determining the

course of masking.

Results

In order to facilitate comparison between psychoacoustic and speech-intelligibility

measures of performance, masked and unmasked audiograms are plotted along with the

speech intelligibility functions. For each group are shown a typical subject (data

points) together with total and interquartile ranges across subjects (shaded).

In Fig. 1 are presented the data for listeners having normal hearing (N), as evi

denced by normal pure-tone thresholds throughout the entire audible frequency region

shown in the upper-left panel. The quality of speech intelligibility at various

levels of background noise is given in terms of the speech-to-noise ratio or masked

speech-intelligibility threshold and is depicted in the upper-right panel of each

figure. In general, it can be seen that intelligibility thresholds of about - 5 dB

are obtained at low and moderate noise levels, with a slight increase in threshold

apparent at the higher levels of background.

The pattern of pure-tone masking is depicted in the masked audiograms which are

plotted in the lower panels of Fig. 1. The masked audiogram relates signal threshold

to frequency of the signal presented against a longer-duration 1000 Hz tonal back

ground. As expected, we observe the asymmetry of the masked audiogram which is con

sistent with the standard observation of an upward spread of masking into the higher

frequencies. Comparing the masking patterns obtained at the two masker levels of

75 and 105 dB, we note that the spread of masking becomes more prominent with an

increase in the level of the masker. The present masking results, while in good

qualitative agreement with the classical findings of Wegel and Lane (1924), are, we

feel, noteworthy insofar as they demonstrate that perfectly reasonable masking

results can be obtained using a method of adjustment with listeners having only

minimal experience in psychoacoustical experimentation. In view of the consistency

of the measurements of masking across listeners comprising the control group, it was

deemed appropriate to compute an averaged masked audiogram. The latter serves as a

standard measure of performance against which masking for the hearing-impaired

listeners are compared and is indicated by the dashed line in the appropriate14 figures.

10090

CLASS:NORMAL (n-8) B

60 70 80SPEECH LEVEL (dB(A))

5040N

+10

-III.:so 0

~a:z.....(/) -1

10 205Q.5 1 2FREQUENCY (kHz)

0.1 Q2

.I I A

m-• """...•.... r"'lI--_.

~.

8

4IV1 N

iii.:s(/)(/)

9Clz~W::I:

~

80r-----.-I""'"'"T..,...,............-----r-~T""'T'""T"T"1"'TT'"____::C:-l

M

110.-----r-T"'""'1r--r-"T"T"1r"TT'"---r---r-r-,,-T'TT,-----:0:-"1

M

Q.5 1 2 5 10 20SIGNAL FREQUENCY (kHz)

_90..J0.(/)

III.:sClzs;::(/)ct:E7

Q.5 1251020SIGNAL FREQUENCY (kHz)

0:1 Q22 N

Fig. 1. Combined data for listeners with normal hearing.Panel A shows audiograms, panel B gives speech intelligibility thresholds in noise(as SiN-ratio) as a function of speech level, and panels C and D show masked audiograms for 1 kHz maskers of 75 and 105 dB, respectively. Throughout, data point represent data of one typical subject, dashed lines averages across subjects (n = 8)and hatched areas indicate interquartile and total ranges.

In Figures 2, 3 and 4 are plotted the performance functions for listeners classified

as "noise tramata"(T), "presbyacusic" (P) and"ototoxic" (D) , respectively. From in

spection of these data several trends emerge, which, we feel, characterize the

listening performance of listeners with selective high-frequency hearing loss due to

sensorineural hearing damage. Most important, we note that according to the AverageHearing Loss index (as deducible from the audiogram) all of our impaired

listeners would be considered to have near-normal hearing. That the audio grams do

not accurately portray the hearing capacities of these listeners is immediately

obvious from examination of the suprathreshold performance functions for masking

ana speech intelligibility. First, the masked audiograms (solid curve) reveal con

siderably more upward spread of masking than that obtained from normal listeners

(dashed curve). Moreover, this enhanced spread of masking observed for impaired

listeners is manifested in frequency regions where pure-tone thresholds in quiet

are perfectly normal. While the reader can confirm this observation from inspection 15

100

B

9060 70 80SPEECH LEVEL (dB(A»

---------------_._--------------------_... " .......

50

CLASS:TRAUMA(n07)+10

Iii:20

~0

II:

Z.....1Il_1

T

20 40

CJ40H--t---t----t---tZ

~6OIH--+--+-I-I--\+w:I:

t8 H---+--j--j----+-+-+--+--I¥-+...-=+---I1000~;;."---:O=':2;-.L-'--'-:0=':.5"""""'-'-:1!-...L--:!:2-.L-J....:.:~...L.,L"J-~

FREQUENCY (kHz)

c:Jz12III<t:E4

\\\"\

\\\

"\\\\"\.._-

c

10 2020.'::-T----;!:".--'-----'-,:;!"......J...J...'-!-----=--J-"---!:,-J--Jl..J....L~-."J.

0:1 Q2 as 1251020SIGNAL FREQUENCY (kHz)

5O!-;T_--='~-'-...I....::'="'_.J....L'_!__-~-"'-J.......I--'-Li..J.l.--....J0:1Q2 05125

SIGNAL FREQUENCY (kHz)

Fig. 2. Combined data for listeners with noise trauma. Lay-out of panels as inFig. 1.

of any impaired subject's masked audiogram, let us illustrate the point by referring

to subject JB's performance. In Fig. 2a we observe that in the low-frequency region,

thresholds in quiet are normal, whereas above 2000 Hz a precipitous fall-off in

hearing sensitivity occurs due to acoustic trauma. Concentrating our attention on

the region above the masker at 1000 Hz and below the area of threshold elevation,

it can be seen from the masked audiograms that masking is about 20 dB greater than

is recorded for normal listeners. From examination of the masking patterns of other

listeners with selective high-frequency loss, we observe elevations in masking ~s

large as 40 dB. Moreover, this enhanced masking effect often occurs in frequency

regions remote from the region of hearing loss where pure-tone thresholds are normal.

The speech intelligibility signal-to-noise thresholds are plotted in panels D. There

is anything from a S to lS dB increase in the speech-to-noise ratio relative to

performance of normal listeners. Moreover, the increase in the SiN is observed

16 throughout the listener's available range of hearing, thus indicating that the aural

CLASS:PRESBYACUSIS(n=14) BA

SlMli:s1Il1Il2gC)z~6QH---t--+-+-+-FwJ:8OIH---t--+-+-+-F+--++-+t1100!:7I----="~...........-='=......w~--...J.~--...J.---l.~u..J..L.L..L..-....J

0:1Q2 as 12 20FREQUENCY (kHz)

+10

S:sQ O~lZ:.Z....1Il_10

p

40 50 60 70 80SPEECH LEVEL (dB(A))

90 100

II

\II,\. r--~_J

3

20bP----.!':<--'-........1;:J-..L..J..>+------,\,...-.J---'----i......-'--'-''i><--.,.J0:1 Q2 1 2 5 10 20


50L:..P_-='=,------~'"='=;_L_L...&...LJ:--~---'---'-=-'"...L..L~-__='0:1 Q2 as 1251020


Fig. 3. Combined data for presbyacusic listeners. Lay-out of panels as in Fig. 1.

overload attendant to high-level stimulation cannot account for loss of speech in

telligibility experienced by listeners with predominantly high-frequency hearing

loss. A rise in speech threshold seems to be related to a significant upward spread

of masking.

Discussion of results

In summarising the main experimental findings the following results should be

emphasized: (1) for normal hearing adults, the good agreement between the present

data and similar findings in the literature is evidence of the suitability of the

psychophysical procedure of adjustment to measurement of both speech intelligibility

and pure-tone masking for inexperienced listeners; (2) relative to the normal control

group, listeners with selective high-frequency hearing loss show as much as 40 dB

more masking at high frequencies; (3) the increased upward spread of masking ob-

served with impaired listeners is often found in regions of normal pure-tone thres- 17

A

:------...;~ .....-'

~ v--' \~ J"...

~

1\

1'\1: Gy.dG.~.e.) \

o • T.G.(I.e 1\

CLASS:DRUGS(n=2) B

10090

-- ------_......-...---

60 70 80SPEECH LEVEL (dB(A»

----------------------

1: G.'LdG.lr.e.)o. T.G.(~.)

40 SO

iii:sOOl--+---+----o--+----+-+-~+---+-__+_-t___--+______j

eia:.z.....1Il.

10 20as 1 2 SFREQUENCY (kHz)

Q2

·20

iii 0:s1Il920

~3O

~4Ow

ISO

t 600:1

80 11C D

70 100

60 _90::i ..I0- 0-1Il 1IlIII:SSO III

C) :Saoz C)

S2 Z1Il 52ce 1Il

== 40 ~7

30 60

1:G~-f-~e) 1: G.'LdG.(r.e.)20 o· T.G. so o· T. G. (I.e.)

0:1 Q2 as 1 2 S 10 20 0:1 Q2 o.s 1 2 S 10 20SIGNAL FREQUENCY (kHz) SIGNAL FREQUENCY (kHz)

Fig. 4. Combined data for two ototoxic listeners. Further lay-out as in Fig. 1.

hold; (4) masked speech-intelligibility thresholds of impaired listeners are in the

order of 5 to 15 dB worse than normal observers; (5) whereas measurement of hearing

in terms of the traditional Average Hearing Loss index frequently does not serve

to distinguish members of the experimental and control groups, suprathreshold mea

sures of speech intelligibility and pure-tone masking show essentially no overlap

between the two groups.

Although both the smallness of the number of subjects partaking in the present ex

periment and the limited stimulus parameter space investigated limit the generality

of the present findings, the potential implications of the results are so far

ranging that they deserve detailed consideration. Following a discussion of how the

present results bear on our notions of mechanisms underlying auditory masking, more

applied audiological problems of diagnosis and treatment of perceptual handicaps are

considered. In the concluding subsection are presented some final observations on how

the present laboratory findings may influence the reformulation of medico-legal

18 statutes governing determination of hearing handicap.

The upward spread of masking and sensorineural hearing loss

Investigations directed at measuring pure-tone masking for listeners with sensori

neural hearing loss have not prdduced a consistent picture of auditory masking in

the impaired ear. For example, De Boer and Bouwmeester (1974), who investigated

masking patterns produced by narrow bands of noise, have demonstrated increased a

mounts of masking above the passband of the masker in ears with cochlear pathology,

presumably attributable to pronounced harmonic distortion. Nelson and Bilger (1974),

on the other hand, have obtained equivalent masking for probes placed at the octave

above the pure-tone masker. Reports in the older clinical literature, unfortunately,

do not help us to understand the discrepancy between the two reports.

In an attempt to resolve the apparently contradictory findings Leshowitz and Lind

strom (1977) investigated the spread of masking in regions of normal and elevated

threshold within the same ear. The shape of the psychoacoustical tuning-curve,

relating the level of tonal masker just sufficient to mask the fixed frequency probe,

served as the measure of frequency resolution. They observed a complete loss of

frequency selectivity in regions of threshold elevation, whereas in the normal

regions of hearing, tuning curves appeared to be similar to those of normal

listeners. In a second experiment, however, they obtained evidence indicating that

in normal regions of hearing there were aspects of auditory processing that had

been altered by the presence of a lesion in a remote region of the cochlea. It is

well known that presentation of two primary tones gives rise to perception of

additional tones, called combination tones, not present in the stimulus. In listeners

with abrupt high-frequency hearing loss, Leshowitz and Lindstrom observed a marked

reduction in the generation of odd-order combination tones although the primary

tones were both presented in regions of presumably normal hearing.

Additional evidence of abnormal auditory processing in seemingly normal regions of

hearing was also obtained in the present experiment. From inspection of the masked

audiogram it has been observed that when both the probe and masker were located in

regions of normal hearing, masking was as much as 30 dB greater than observed in

listeners with completely normal audiograms. We are hard pressed to account for

this observation in the light of the generally accepted view that there is minimal

sensory cell degeneration in regions of normal behaviour threshold. On the assumption

that a near-normal complement of sensory cells exists in regions of normal threshold,

we are forced to conclude that the physiological insult is more subtle than has here

tofore been realized. Investigations of the physiological correlates of the pronounced

masking effect observed in regions of normal threshold in the traumatized ear pre

sent an intriguing and challenging opportunity, one that has relevance not only to

a fundamental understanding of the mechanism of auditory masking, but also to

understanding the speech handicap experienced by listeners with selective high

frequency hearing loss.

Although we cannot offer speculation about the mechanism underlying the pronounced

spread of masking in ears with cochlear pathology, the empirical finding may offer

insight into the basis of the speech communication problem characterizing listeners

with sensorineural impairment. 1£

In earlier attempts to account for the communication loss reported for listeners

with selective high-frequency hearing damage, the threshold dip has been regarded

as selective attenuation of high-frequency information .. This, together with the

limited dynamic range (i.e. recruitment) in the region of threshold elevation

has been held responsible for the loss of speech discrimination in noise. This view

is consistent with the stated need for selective amplification and automatic gain

control in hearing aids.

Quite a different explanation of the speech deficit is suggested by the masking

patterns obtained with impaired listeners. As a working hypothesis we suggest that

the pronounced spread of masking into the high-frequency regions, including areas

of normal threshold, attendant' on the threshold dip is a major causative agent for

the listener's communication loss in noise. In order to test this notion, a simple

demonstration experiment was conducted. A typical high-frequency hearing loss was

simulated in a normal listener by presenting a 3000 Hz continuous tone along with

the discourse. As befor-e, the task of the listener was to adjust the level of the

noise until the speech just became intelligible. Addition of the tone was found

to have no effect on speech intelligibility. Thus, we conclude that the loss of

high frequency information in the speech waveform is due to threshold elevation and

that increased recruitment plays only a secondary role in determining the quality

of speech perception. It is clear that detailed measurements of the frequency se

lectivity and auditory nonlinearities attendant on hearing impairment are needed

before we can reach a more fundamental understanding of the hearing disorder.

Assessment of hearing handicap

The hearing handicap, as argued, can be usefully expressed in terms of the diffi

culty to hear and communicate in an everyday situation of ambient noise. A quanti

tative index of the individual'sperformance in such a situation is his speech

intelligibility threshold determined against a background of noise (expressed as

signal-to-noise ratio). Insight into the value of this index is gained by analyzing

the everyday situation of two competing sound sources, One a primary talker and the

other unwanted interference from another speaker or TV, etc. Plomp (1976) has shown

that, under ordinary room acoustics and placement of the two sources, the primary

sound is about 5 dB above the comfort level for understanding of conversation by the

normal-hearing listener. We have seen earlier that listeners with selective high

frequency hearing loss have masked thresholds between 5 and 15 dB higher than the

normal control group. Thus the prediction is that almost all our non-normal experi

mental listeners will have serious, if not insurmountable difficulties in under

standing the speaker in noise.

The above hearing handicap can be estimated directly by measuring the listener's

masked speech-intelligibility threshold. However, the apparent relationship between

speech discrimination in noise and the prominence of the upward spread of masking

suggests that it may well be possible to predict the handicap from audiometric

measurements of masked threshold, thereby bypassing the difficulties inherent in

20 speech-intelligibili ty measurements.

Using the averaged masked audiogram obtained for the 75 dB-l000 Hz masker for

normal observers as the baseline, we can quantify the degree of the upward spread

of masking in individual listeners by averaging the elevation of masked threshold

at 1500, 2000 and 3000 Hz above the normal masked thresholds. The averaged masked

threshold elevation, which we shall call the Handicap Index (HI) in conjunction

with a simple decision rule can now be used to predict speech intelligibility.

Fig. 5 is a scattergram depicting masked speech-intelligibility threshold,averaged

across speech levels of 65 - 95 dB, plotted against HI for all the listeners partaking

in the experiment. If we accept for a moment that a 5 dB elevation of intelligibility

threshold gives rise to a significant hearing handicap, we can evaluate various de

cisions. To illustrate the approach, assume that we adopt a rule whereby all listeners

having an HI greater than 10 dB are classified as handicapped. From Fig. 5 we observe

that the probability of correctly identifying a handicapped individual is close to

unity. Unfortunately, a few individuals are incorrectly classified as having a handi

cap. Nevertheless, the outcome is deemed quite respectable, especially when it is

realized that according to the Average Hearing Level measured in the quiet all listeners

would be considered normal. Assessment of hearing handicap with the HI approach, while

very promising, must be the substantiated in many additional tests before we can re

commend the procedure as a clinical diagnostic tool.

40r----r----r----,,-----,------,

•...

• -• ......• .....- ...

x -.... .. ..• ..-A •.. -• A

..xx class drugs... ,. presbyacusis

A A old normal

• • J...•• • trauma _•• A • normal

- inherited

-5 0 5 10 15AVERAGE SIN RATIO CdBI

_30..J0.Ul

ED

"-20-:t)(wo:!!;1Or-0.e:(Uoz Ofe:(:I:

_10L--~_L___-----'L_______I ____I ___'

-10

•

Fig. 5. Scattergram of Handicap Index (i.e., the average "supranormal" upwardspread of masking obtained at 1500, 2000 and 3000 Hz) against masked speech intelligibility threshold.

Possibilities for rehabilitation

Practically, all modern hearing aids are designed to produce either selec~ive or broad

band amplification of sound. Since wanted speech signal and unwanted background

undergo equal amplification, the signal-to-noise ratio is at best unaltered, (in

practice it decreases somewhat due to noise in the instruments). This implies that

listeners with a flat speech intelligibility loss (all our non-normal listeners) 21

benefit from the traditional hearing aid only in a situation where background is

negligible to start with (Plomp, 1976).

If we are to significantly improve the speech-reception capabilities of our typical

impaired listener, the signal-to-noise ratio must be increased by about 10 dB.

A straightforward approach to solving the hearing dilemma entails detaching the

microphone from the hearing aid. By giving the microphone to the speaker, a consi

derable improvement in the SiN can be realised. From the inverse-square law of

physics we know that in the direct sound field acoustic signal strength decreases

6 dB per doubling of distance. At a comfortable speaker-to-listener distance of

two metres, the loss of signal strength is about 10 dB. Since ambient noise in a

room is about the same level at all locations, bringing the microphone near the

speaker improves the SiN by 10 dB. This 10 dB improvement in the SiN provides the

margin between intelligible and unintelligible speech and is sufficient to enable

most hearing-impaired listeners to understand speech about as well as the unaided

normal individual.

Present research efforts of the author and others at the Institute for Perception

Research are now being directed at developing a detachable microphone system that

not only delivers the required SIN, but is also ergonomically acceptable to the

user. The detachable microphone system incorporated in the personal hearing aid has

obvious shortcomings; none the less, the concept has received great acceptance as an

audit·ory trainer in schools for the hard-of-hearing. Borrowing from the technology

of light transmission of TV audio developed by Sennheiser, a prototype for the de

tachable microphone system utilising infrared light transmission of the acoustical

signal has been developed. The system consists of two components: a transmitter,

which includes the microphone; and a receiver comprising an infrared light receiver,

an amplifier and an earphone. A detailed report of the initial evaluation of the

prototype is beyond the scope of this report. However, it can be stated that we are

most encouraged by both the improvements in speech intelligibility rendered by the

device and the enthusiastic endorsement that the users have voiced in support of thedevice.

Conclusion: toward a reevaluation of the medico-legal standards of hearing handicap.

It should be abundantly clear that the basic message of the present work is that

evaluation of hearing capacities must take place under realistic conditions of

moderate background noise. That listeners with audiological evaluations of "normal"

hearing based onpure~one thresholds experience serious difficulties in underst~nding

speech in noise can hardly be unknown to the practicing audiologist. What is pain

fully obvious is that existing procedures for assessing the hearing handicap stillemphasize measurements taken in quiet and therefore do not capture the daily audi

tory experiences of the impaired listener. Hearing handicaps are often subtle, being

manifested most acutely in noise and are undetectable using standard audiological

techniques. Likewise, legal standards of damage-risk criteria for environmental noise

22 give carte blanche to insulting agents that do not intrude in the region of the

speech frequencies. In other words, high-frequency hearing is legally expendable.

The very profound deficits we have uncovered for listeners with exclusively high

frequency hearing loss clearly demonstrate how ludicrous is the prevailing view ofhearing handicap.

In conclusion, perhaps the major problem in the impaired-hearing field is to reach

a consensus as to what constitutes hearing handicap. Having established an acceptable

standard for evaluation of auditory capacities we can then proceed with research

directed at developing effective prosthetic devices. And, perhaps more important

still, we can begin devising legal standards governing "safe" levels of environmentalirritants.

"An ounce of prevention is worth a pound of cure:"

Summary

In this paper experiments are described ~o establish empirical evidence for the

audiological observation that listeners with normal pure-tone thresholds below 2000 Hz

and selective high-frequency sensorineural hearing loss often experience great

difficulty perceiving speech in a noise background. For patients with either noise

trauma or presbyacusis, masked-speech intelligibility thresholds (SIN) were about

10 dB higher than for normal observers. In an effort to provide a psycho-acoustical

explanation for the speech communication deficit, pure-tone masking patterns were

measured. Relative to the normal control group, listeners with high-frequency

hearing loss showed as much as 30 dB more upward spread of masking, often in fre

quency regions of normal pure-tone threshold. The strong positive relationship be

tween the masked-speech intelligibility threshold and the upward spread of masking

suggests that it may be possible to predict the patient's speech perception handicap

in noise from audiometric measurements of masked threshold. Implications of the pre

sent work for development of close-talking-microphone hearing aids are indicated.

References

de Boer, E. and Bouwmeester, J. (1974) Critical bands and sensorineural hearing loss,Audiology, 11, p. 236-259.

Courtois, J. (1975) Binaural IROS fitting of hearing aids. Scandinavian Audiology,supp!. 5, p. 194-230.

Leshowitz, B.H. and Lindstrom, R. (1977) Measurement of nonlinearities in listenerswith sensorineural hearing loss, In: E.F. Evans and J.P. Wilson (Eds.) Psychophysics and Physiology of Hearing, Academic, London.

Nelson, D.A. and Bilger, R.C. (1974) Pure-tone octave masking in listeners with sensorineural hearing loss, J. Speech and Hearing Research, 12, p. 252-269.

Plomp, R. (1976) Binaural and monaural speech intelligibility of connected discoursein reverberation as a function of azimuth of a single competing sound source(speech or noise), Acustica, ~, p. 200-211.

Plomp, R. (1978) Auditory handicap of hearing impairment and limited benefit ofhearing aids, submitted for publication.

Wegel, R.L. and Lane, C.E. (1924) The auditory masking of one pure tone by anotherand its probable relation to the dynamics of the inner ear, Phys. Rev., ~,p. 266-285. 23

Further psychophysical data on two-tone suppressionH. Duifhuis, J. Smits, J. v.d. Vorst and M. Scheffers

Introduction

We have previously reported theoretical (Duifhuis, 1976) as well as experimental re

sults on two-tone suppression (Duifhuis, 1977). So far the experimental psychophysi

cal data were limited to two-tone suppression for a suppressee frequency of 1 kHz.

Data have been obtained exclusively with the pulsation threshold technique.

In this paper the following additional results are presented: (1) a comparison between

two-tone suppression in pulsation threshold and in forward masking; (2) two-tone

suppression data at suppressee frequencies other than 1 kHz; (3) the effect of a white

noise background on the suppression effect. These additional data are considered

necessary not only for-better understanding of the general agreements and significant

differences established to date between ~heoretical predictions and experimental data,

but also for a better understanding of the suppression mechanism.

presented here exclusively in the Fig. 1 format.

For fixed suppressee frequency (fS) and

suppressor (or masker) frequency (fM) it

shows the effect of suppressor level (LM

)

on "activity in the suppressee channel" as

monitored by the probe threshold (L ),P

where fp=fS

' Suppressee level LS is a para-

meter. Suppressee S and suppressor Mare

presented simultaneously, the probe signal

p separately. Part (a) of the curve charac

terises the situation where the suppressor

is ineffective because LM

is too low. Branch

(b) shows increase in suppression (decrease

in response) with increasing LM

. Branch (c)

of the suppression curve represents the

situation where the suppressor is so much

stronger than the suppressee that Lp re

flects response to the suppressor (M) instead

of suppressee (S). The depth D of the sup

pression notch is a quantitative measure of

the suppression effect.

c-

o/8

1----'-----=-- -------

Two-tone suppression data will be

Fig. 1 Schematic result of a two-tonesuppression experiment. The ordinategives the response L to a fixedsuppressee (L

S) as aPfunction of sup

pressor level (L M). The curve istypical for f <f I.

M P

ll...J

1

Pulsation threshold vs forward masking

The comparison between the pulsation threshold and forward masking method in the

two-tone suppression case was primarily motivated by the considerable day-to-day

variability which was found in parts (b) and (c) of the suppression curves (Fig. 1).

Within subjects, threshold settings cover ranges up to 15 dB. Fortunately, the sup-

24 pression effect measured was often twice as large. Nevertheless we wondered whether


the variability was inherent in the pulsation threshold method, and whether more

stable results could be obtained by another method. Subjects did report some dif

ficulties in keeping the criterion in the adjustment task constant. Therefore we

decided to compare with results from a two-interval, two-alternative forced-choice

experiment. Since this paradigm cannot be used for determining a pulsation threshold,

it was applied in a forward masking experiment. This choice also helps to relate ourpulsation threshold results, e.g., to Shannon's (1976) forward masking data on two

tone suppression.

460ms4003002

M+S

020

Our implementation of the pulsation threshold technique has been described in Duif

huis (1977). In the forward masking experiment we used masker durations of 400 ms

separated by an BOO-ms silent interval. Probe duration was 20 ms at half amplitude,

and all signals had cosine-shaped ramps of 20 ms (Fig. 2). Probe onset started im-mediately at the end of the

masker offset ramp. The sub

ject used the sequential up-

and-down strategy described

by Cardozo (1966). Typically,

40 to 100 trials were presen

ted for the determination of

masking, respectively. In

both cases fS=fp = 1kHz,

f M=400Hz, and LS=45 dB SPL

were used. In one session the

subject measured a series of

points by one method and thesame serie again by the other

method. Data from one session

are represented by a single

symbol in the two figures.

Figure 3 shows results similar

to those obtained previously.

order of measurements:.0·0 .... 6,%x.o

S:JS30 '---+---t---t---f-----<>----1I-----l-~.

50 60 70 80 90MASKER LEVEL LMldB SPLl

4Q-

::JQ.III

~ 50 r----1---+----+----+-----<~-+--........-..,lI>.

....IQ....IoJ:

ffia:J:....~!<III....I::JQ.

one 75% correct threshold.Fig. 2. Time cours of the forward masking stimulus. Measurements were repeatedRamps are cosine-shaped. (M: suppressor, S: suppressee,P: probe). over the course of several

weeks. Fig. 3 and 4 show results of subject JS for pul

sation threshold and forward

Fig. 3. Ten series of pulsation threshold data (twotone suppression experiment) obtained over a 2-monthperiod. Parameters: f M= 400Hz; f p= 1kHz; LS= 45dB SPL.

::i 40Q.

0 0IIIIn

1 1.0 .

:g

~ t"lI>. .0....I

Q3(}

~....I0J: e <>III •Wa: <>J: '$>.... .xw 20 order of measurements: <>In .O·D"'~J:X·O <>~ S:JSQ.

50 60 70 80 90MASKER LEVEL LMldB SPLl

For LM<70 dB the estimated

standard deviation s = 1 dB

(branch a). With increasing

L the values of s increasesm

to about 5 dB.

Fig. 4. Data of the forwardmasking experiment corres-ponding to Fig. 3. 25

Figure 4 gives the corresponding forward masking results. Note that the variabilityhere is independent of L . Over the entire range we find s ~ 5 dB.mAt L = 85 dB we found significant correlation (r = 0.8) between forward masking andm -pulsation thresholds obtained within a single session. Lack of correlation at other

levels, due partly to the fact that variability in pulsation thresholds was too small,

prevents far-reaching conclusions from being made. At any rate, the forced-choice,forward-masking method did not produce results that were superior to pulsation thres

hold results insofar as day-to-day variability was concerned. Finally, it may beworthwhile mentioning that further measurements indicated that variability could bereduced by shortening measurement series to sessions of, at most, 15 min.

Two-tone suppression at other frequencies

70r------t------+--+-----lf------..--i---+------,

80 r-----+-----+--+-----lf------..--+-------.

3O=----±-------=1-----!--_---if--_--150 60 70 80 90 100

------- LM IdB SPLI

Additional two-tone suppression datawere obtained with the pulsationthreshold technique. The experimental

set-up was essentially the same asthe one previously used, except that

available equipment now produces trapezoidal envelopes instead of envelopes

with cosine-shaped ramps. Onset andoffset were adjusted to 25 ms.

Some results are shown in Fig. 5 for

f = 0.5, 1, 2 and 4 kHz. Qualitative-s

ly the results are very similar for

all suppressor frequencies. Two-tonesuppression depends strongly on theratio f If , but weakly on absolutem sfrequency. A close look at the datamight suggest a decrease in suppression at higher frequencies.

A

•

S:JVfplfmlHzl

.500/200·1000/600.. 200011200% 4000/2400

40

::l x XQ.Ul

IIIX:!:!

l- XII.....I

r

60

B

40 =-----..-""'±-_~±--+------:l--+---+---+----l50 60 70 80 90 100

--->LMIdB SPLI

Fig. 5. Two-tone suppression data atseveral suppressee frequencies f .In part A are shown examples forPf <f , in part B for f >f . All datasRowR in this figure uWedPL =60 dB SPL.s

The results shown here are consistent

with comparable data (Abbas and Sachs,

1976; Shannon, 1976) and theoretical

expectations.

••

•

•

S:JVfplfmlHzl

·500/600.2000/2400

Two-tone suppression against a continuous noise background

The effect of a continuous noise background on two-tone suppression was investigated

26 in a number of conditions where a relatively strong suppression effect had been

measured. The only change in the experimental set-up consisted in the addition of a

continuous white noise signal to the headphone. The noise was presented at 3 spec

tral densities differing by 10 dB.

Fig. 6. Two-tone suppression data at f = 2 kHz andf = 0.8 kHz and L = SO dB SPL, for sevgral continuoNs white noise b~ckgrounds. Parameter is the spectral density of the noise in dB/Hz.

90 100--_. LM (dB SPLI

8070

• nonoisex -2 dB/Hz• 8 dB/Hz• 18 dB/Hz

60

30

::JQ,enlD 50 •:g .lL

....I

I4

S:JV

Fig. 6 shows an example of the

results. The major effect of the

noise is to fill up the suppres

sion notch and decrease the sup

pression effect. A 20 dB increase

in spectral density level No

suffices to reduce suppression

from maximum to zero effect.

Reduction of suppression by noise

addition can be interpreted in

terms of wide-band noise acting

as a suppressor (cf. Houtgast,

1974; Duifhuis and Simons, 1976;Leshowitz and Lindstrom, 1977).

The background noise suppresses

the suppressee, and if maximumsuppression is obtained in this

way then addition of the tonal suppressor cannot amplify the suppression effect.

Since probe and suppressee are affected in the same way, the suppression by the

noise background does not show up in a downward shift of branch (a) in the suppres

sion curve (Fig. 1). A secondary effect is that the noise background also suppresses

the masker. This effect by itself would cause a local shift of the suppression curve

along the L -axis. The effect depends on the L /N ratio and will disappear at suf-m m 0

ficiently high masker levels. The suppression curves for N = -2 dB/Hz and N = 8 dB/Hzo 0

provide some evidence for this effect (levels in SPL).

The present results support the assumption that a simultaneous wideband noise maskeris an effective suppressor.

Conclusion

Day-to-day variability in pulsation threshold is the same size as in forward masking

using a two-interval two-alternative, forced-choice method.

Psychophysical two-tone suppression depends predominantly on the frequency ratio of

suppressee to suppressor, given the amplitudes.

In a background of continuous white noise the two-tone suppression effect seeminglydisappears.

Qualitatively, these findings are in agreement with our theoretical expectations. A

quantitative description of all data requires more theoretical and experimental work. 27

28

References

Abbas, P.J. and Sachs, M.B. (1976) Two-tone suppression in auditory-nerve fibers:Extension of a stimulus-response relationship, J. Acoust. Soc. Amer., ~, p. 112122.

Cardozo, B.L. (1966) A sequential up-and-down method, I.P.O. Annual Progress Report,1, p. 110-114.

Duifhuis, H. (1976) Cochlear nonlinearity and second filter. Possible mechanism andimplications, J. Acoust. Soc. Amer., ~, p. 408-423.

Duifhuis, H. and Simons, W.F. (1976) The0retical responses of the "hair-cell BPNL"model to bands of noise, I.P.O. Annual Progress Report, 11, p. 2-9.

Duifhuis, H. (1977) Cochlear nonlinearity and second filter. A psychophysical evaluation, In: E.F. Evans and J.P. Wilson (Eds.), Academic Press, London, p. 153-163.

Houtgast, T. (1974) Lateral suppression in hearing. Doctoral thesis, Free Unive~sity.

Amsterdam.

Leshowitz, B. and Lindstrom, R. (1977) Measurement of nonlinearities in listeners withsensorineural he·aring loss, In: E.F. Evans and J.P. Wilson (Eds.), Academic Press,London, p. 283-292.

Shannon, R.V. (1976) Two-tone unmasking and suppression in a forward-masking situation,J. Acoust. Soc. Amer., ~, p. 1460-1470.

Preliminary experiments on accent perception in tone sequences

J. Thomassen

Introduction

When listening to a sequence of tones, some tones seem to be more prominent than

others and are said to have accent. This description, in spite of its vagueness,

can be turned into an operational definition of accent. Accent, then, is to be

considered as a concept in the perceptual domain. Note that it can be described

without making use of physical properties of the tone sequence.

In the physical domain, certain £actors may elicit the perception of accent. The

mapping of the physical domain onto the perceptual domain is the main purpose of the

investigation, which started just over a year ago. It will be useful to introduce

separate terms in the physical domain. Hence accentuation is introduced as a physi

cal property of a sound which gives the impression of being an accent.

Restricting ourselves to pure tones, three physical properties stand out.

1. A tone that has a higher ioudness than its neighbours is said to have dyn~mic

accentuation.

2. Temporal accentuation is understood to be the set of operations in the time

domain that result in the perception of accent.

3. Melodic accentuation is the accentuation given by the succession of frequency

intervals of the sequence.

The three accentuations will, of course, interact and non-physical factors like

memory, expectation etc. are known to affect accent(-perception) as well.

An operational definition of accent is closely related to a method of measurement. The

first problem to be solved is thus to find a reliable and efficient method of measure

ment, that does justice to the common notion of the concept of accent. Apart from that,

it must be ensured that the definition is not dependent on the particular method used

and therefore a number of methods has to be compared. It should then turn out that

in a given sequence the same tones are considered as accents with the various methods.

The methods can be profitably tested using stimuli with dynamic accentuation only,

as the mapping accentuation-accent seems rather obvious in that case.

Having found (a) methodes) of measurement that can be extended to the other parameters,

a measuring programme will be carri~d out by one particular method that has been

established as most suitable. In this programme it is hoped to find a relationship

between accent perception in music and accent perception in speech. The most power

ful cue to the perception of accent in speech appears to be a rapid pitch rise, pitchfall or combination of the two (It Hart and Cohen, 1973). It has also been shown that

the timing of pitch movement with respect to vowel onset and end is particularly im

portant, as is its position in the overall pitch contour (Van Katwijk, 1974). There

fore the investigations will be mainly directed towards the contribution of melodic

accentuation to the perception of accent in music. 29


Experiment 1

The simple method started with, involves indicating the position and strength of

perceived accentuation in a sequence of tones on a response form.

The stimuli were short isochronous sequences of N(N=4, 5 or 6) 1 kHz tones (55dB SL)

with accentuation produced only by small differences in sound level. In each se-

quence no more than one tone was accentuated. The posltlon (n, ,n=l, 2 ... N) of the

accentuated tone and the strength of the accentuation (S, S=l, 2, 3 or 4 dB) were

varied over all possibilities.

The tone sequences were generated on the MARIE set-up (Moonen, 1975) connected

to a P9202 mini-computer and a HP3320B frequency synthesiser. The recorded stimuli

were presented to 10 subjects, diotically,using headphones in a sound-proof booth.

Averaging the number of incorrectly localised accentuations, likewise as the number

of accentuations shifted one position and dividing by the number of stimuli, we

obtain the quantities C(S,N) and C1(S,N). Combining the results over N=4, 5, 6 and

averaging over the 10 participating subjects gives Fig.1 in which C(S,N) and C1(S,N)

are plotted with their standard deviations.

( 1)

(2)

C(S,N)=D(S)E(N)+(l-D(S)) .P(N)

C1 (S,N)=D(S)E1 (N)+(l-D(S)) .P1 (N)

A simple model may clarify the observed data. Having

a detection probability of accentuation, D(S), we are

left with the probability of making a localisation

error in case of detection of accentuation, E(C), and

eventually the probability of shifting one position,

E1(N), averaged over all positions in the sequence.

For a gambling subject (stimulus not detected) the

last two probabilities reduce to P(N)=(N-2)/N and

P1(N)=2.(N-1)/N 2 , respectively. We can now write down

the following equations for the probabilities:

Putting E(N)~E1(N) (most localisation errors will be

one position out), D(S) is obtained by subtracting

(2) from (1). An estimate of E1 (N) can be made taking

C1(4,N)~E1(N) because we have D(S)~l for S=4 dB.

4 S(dB)-32

1.060

t tz0 a:~ w

IIIC)~ pc(

a: ::llAo Z

The fitting of dotted lines and data points confirm

that the approximation E(C)~E1 (N) was a fairly good

one. In Fig. 1 the dotted lines connect the points'

C(S,N) and C1(S,N) that have been recalculated by sub

stituting the values of D(S) and El(N) in equations

(1) and (2).

We see that with a level difference of 2 dB the 10 subjects could already achieve

more than 80% correct responses in localising the accentuated tone. However, some

subjects still made errors E(N) at a level difference of 4 dB. The incorrect

responses were mostly one position out (E(N)~E1 (N)) and were thus termed "counting

Fig. 1. Fraction (number) ofincorrectly localised accentuations C and the fraction(number) of accentuationslocalised one position out/displaced C1 as a functionof strength of accentuationS; P and P1 are the corresponding chance levels. Average over 10 subjects.

0.5 30 J.Cr,,,,\

~\r---~--P1

\,

C11"'l'" " 11

'<.,.~~0.01o::0~ ...,.;..}_E1_

30

errors". Re-inspection of the data showed that certain subjects were responsible

for the "counting errors". In future such subjects may be discarded through adoption

of an E1(N)- selection criterion.

So far this simple method has been found suitable for measuring accent perception

apart from some efficiency improvements to be made. However, a drawback could be

the tendency of subjects to respond at the first tone of the short sequence in the

case of slight accentuation. Working with short fragments of melodies ("motifs"),

where ambiguity is perhaps the rule rather than the exception, this might distort

the possible outcome of the experiments. A solution to this problem could be the

insertion of the motif in a context, as in the next experiment.

Experiment 2

This experiment was arrived at on the following considerations. The accentuation

present in the context, in which a motif is embedded, influences the accent perception

within the motif. Now restriction is made to contexts with a periodic accent struc

ture or metre. A periodic accent structure once established tends to be continued in

the mind of the listener: forthcoming accents are anticipated at distances deter

mined by the period. This is independent of the real occurrence of accents in the

anticipated material. Suppose a motif is embedded in a context with periodic accen

tuation. The anticipated accent of the motif is thus forced into a position deter

mined by the period of the context. This can be done successively for all positions

in the motif and each time the corresponding distribution of accent responses over

the motif tones can be measured. By mutual comparison of these distributions we

are able to decide which motif tone lends itself best to accent apart from antici

pation; next we can infer from the physical parameters of all the motif tones the

(combination of) physical factors which can be conceived as accentuation.

The distributions of accent responses are measured indirectly by asking to what

extent the expected accent coincides with the perceived accent in the motif. As a

criterion for judgement we take the answer to: "How well does the accentuation of

the sequence continue in the motif?"

It is assumed that a subject can handle this criterion well in a situation in which

he compares a pair of tone sequences and indicates which sequence best meets thecriterion.

This task was allocated to 10 subjects. In addition, 6 of them marked how well each

tone sequence of a pair met the criterion on a ten-point scale (10 corresponding to

the optimum). The stimuli were isochronous sequences of 1 kHz tones, the difference

in sound level between accentuated and unaccentuated tones being 4 dB. The sequencesused are shown in Fig. 2.

The motif tones are indicated by tildes, the arrows pointing to the anticipated accents.

There were two conditions: in the first the motif tones could be recognized by their

smaller tone durations, in the second the motif tones and the context tones could not

be distinguished. All possible pairs of the five sequences were constructed and pre-

sented four times in each condition. 31

Fig. 2. Different ways of embedding a threetone motif in a context with triple or duplemetre. The sequences are isochronous, theoverall tone frequency is 1 kHz. The motiftones are indicated by tildes; the arrowspoint at the anticipated accents.

10

tf9

8

7l-

Iz

6w:::E

tt I.ww 5a:C)c( 4

3

2

1

~-'Y - (-I

-- - (-I

Fig. 3. Agreement of perception of accent incontext and motif for the sequences in Fig. 2.Results for scaling (~) and absolute paircomparison (0).32

!fw !fuz

!fwa:wLLLLa

!f..JW>W..J

a

rfz:;)

0(J)

o

A

1 2

D

~-...-

-

~,.,,--

345

TIME

C

SEQUENCE

6

E

7

CODE

A

B

C

D

E

sec.

B

The order within the pairs was balanced

and the presentation of the pairs was

random. The stimuli were generated

with the same set-up as experiment 1.

The data showed no difference between

the 2 conditions and no effect of order

within the pairs. The scaling results

averaged over the 6 subjects and 2

conditions (~) are plotted in Fig. 3

together with the results of the ab

solute pair comparison (averaged over

the 10 subjects and 2 conditions)

transformed to the same scale (0).

The sequence A is scaled significantly

higher than the sequences C, E and B.

This was to be expected because in

A the accentuation coincides with

the anticipated accent. The sequence

D is scaled somewhat lower than A

because there is another accent ex

pected in a position where there is

no accentuation. The scaling results

do not differ very much from the re

sults @f absolute pair comparison

apart from the small but important

difference that with scaling a clearer

distinction can be made between se

quence D and the low-estimated se

quences. For this reason scaling is

perhaps to be preferred to absolute

pair comparison, the more so as

scaling can be applied quite apart

from a pair comparison tas~, which may

save a considerable amount of time.

Used in this way the method can

provide the possibility of measuring

accent because it yields unequivocally

the cases of coinciding accentuation

and anticipated accent.

At the same time two drawbacks of the

method used in the first experiment are

removed. First, the bias of the first

tone (to be considered as a limiting

case of anticipation, viz. the case of

no context). Second, a direct accent

response on the part of the subject. It seems more favourable for the subject to respond

indirectly, i.e. by interpreting the sequence rhythmically instead of applying explicit

criteria as to what tones are to be considered as accents.

~wnma~

Two preliminary experiments, that may become a bridgehead in the measurement of ac

cent perception in tone sequences have been described. A third method of measure

ment - viz. a method using tap responses - has been left open. The methods discussed

so far have yielded no contradictions with respect to the demands we have made on

the implementation of an operational definition of accent. This is not very surprising,

for dynamic accentuation leaves little room for such contractions. However, there.

seems to be reason enough to assume the feasibility of measuring accent perception

in tone sequences with arbitrary accentuation parameters.

References

Hart, J. 't and Cohen, A. (1973) Intonation by rule: a perceptual quest, Journalof Phonetics, 1, p. 309-327.

Katwijk, A.F.V. van (1974) Accentuation in Dutch, Van Gorcum B.V., Assen, Holland.

Moonen, G.J. (1975) MARIE, een Modulaire Aanpassing tussen een Rekenmachine Interfaceen een ~xperiment, IPQ Report no. 267.- --

Thomassen, J. (1976) Waarneming van geringe dynamische accentuering in toonreeksen,IPQ Report no. 300.

33

Estimation of annoyance due to low-level sound

B. LCardozo and K.G. van der Veen

Introduction

Loudness, noisiness and annoyance are subjective attributes of sound. Loudness, i.e.

the subjective correlate of sound intensity, can be assessed in a small number of

subjects because there is, normally, fair agreement among them. Their responses canbe accurately predicted on the basis of physical measurement (cf. e.g. Zwicker et al.,

1967) .

For the assessment of noisiness, i.e. the unwantedness of a sound with particular

reference to its intensity, many subjects are needed in order to average out personal

opinions. Contrary to loudness, there are as yet no general algorithms for predicting

the magnitude of noisiness on the basis of physical measurement (cf. Scharf, 1974).The present paper will not deal with loudness or wi th noisiness but will concentrate

on annoyance.

Annoyance is, like beauty for instance, hard to define precisely but can be considered

as the unwantedness of a sound in general. It therefore depends on personal preferen

ces but, in addition, the environment both acoustic and otherwise must be taken into

account. It is good practice to avoid these problems by presenting a limited set of

not too different sounds to a fair number of listeners who are instructed to give an

annoyance rating or something equivalent. Theoretically, one would have to use a rich

set of sounds, representative of what the population is normally exposed to: traffic

sounds, music, building noise, etc. This set should then be administered to an adequate

sample of the population and their reactions in terms of annoyance noted.

Sound character

Assume now that the above theoretical annoyance data are plotted against a great

many physical parameters, measured for every sound in the set. We would then construct

a multidimensional space, some of the axes representing the sound levels in various

frequency bands, others representing their time derivatives, still other axes giving

the total duration, an objective physical estimate of pitchiness, etc. The set of

sounds is represented as a set of points in this multidimensional "annoyance space".

These points will not be distributed randomly over the space. In fact, many inves

tigators have found annoyance to correlate highly with sound level, cf. Botsford

(1969). Therefore the number of dimensions of the annoyance space can be reduced

by projecting all intensity dimensions onto one new axis, labelled LA' without

seriously affecting the original configuration of points. The choice of the A

weighting factors for this projection instead of more sophisticated weightings is not

essential. We are now left with a space of lower dimensionality in which two axes

are: annoyance and LA' We now collapse all other dimensions into one that gives a

maximal correlation with annoyance, and that is orthogonal to the annoyance axis

and the LA axis. We propose to call this third axis the sound character. In brief,

34 the concept of "sound character" is introduced as the weighted combination of physical

,PO annual progress report 12 1977

properties affecting the annoyance of a sound, with the exception of the A-weightedlevel.

In the above picture two problems have been omitted. First, no mention has been made

of the environmental "dimensions" and, secondly, it has been tacitly suggested that

the annoyance space is linear. However, restricting the discussion to one type of

environment and limiting the set of sounds to a small region in the annoyance space,

it seems legitimate to consider a concept of "sound character" even though it is alocal one.

Annoyance of low-level sounds

The view is advanced that the contribution of sound character to annoyance is re

latively more important at a low level than at a high one. Indeed, at extremely high

levels the annoyance is just pain, no matter what the sound character is:

At moderately high levels, e.g. 70 - 100 dB(A) the literature is not unequivocal on

the contribution of sound character to annoyance. There are two kinds of papers. The

empirical ones, correlating LA to community reactions seem to indicate that LA does

do the job of gauging annoyance (cf. ISO R 1996). This does not, however, disprove

that the sound character is important. A second class of papers is based on labora

tory experiments and maintains the view that noiseness and annoyance cannot be des

cribed adequately by the sound level alone.

Berglund et al. (1976) conclude that certain types of noise (jackhammer) are gene

rally considered more noisy than loud, and the more so the lower the sound level.

Otherwise stated, the sound character is important, especially at a low level.

we"UZzed;:~\a::Z r------....,Zc(

Fig. 1. Part of the annoyance space. Thedrawing is merely meant to illustrate theassumed effect of sound character on theannoyance.

Izumi (1977) investigated amplitude

modulated pink noise with an equivalent

level of about 70 dB(A). This stimulusproved to be noisier than loud by the

equivalent of up to 10 dB(A).

Klaassen (1971) presented a synthetic,

complex, broadband noise of about 55 dB(A)

to some hundred listeners. In comparing

steady state with 100% amplitude-modulated

and with a 6% frequency-modulated v~rsion

(both with a modulation frequency of 3 Hz),

he found the annoyance due to A.M. (F.M.)

to be the equivalent of 8 (7) dB.

It is difficult to find studies on annoyance at still lower sound levels. A

paper by Viebrock et al. (1975) deals

with direct assessments of the loudness

of electric clocks that produce 20 to 35

40 dB(A). The paper is relevant insofar as their subjects comment on what we have

termed the "sound character". We, therefore, think that it is interesting to study

annoyance of sounds with a relatively low level in order to see whether factors other

than the sound level LA must be taken into account. Fig. 1 is meant to summarize the

conceptual situation.

With the above considerations in mind, a pilot study was made of the sound of the

household refrigerator. It is our conviction that annoyance studies should deal with

common sounds, known to the subjects.

The refrigerator sound satisfies this condition. It has, moreover, a low level.

Finally, although the refrigerator is possibly the most silent of household machines,there is an increasing number of complaints about its noise, probably due to thegrowing number of open kitchens.

Ustening experiment

In order to get an idea of the contribution of the sound character to the annoyance

caused by refrigerator sound, a listening experiment was performed with 15 subjects.

Every subject was presented with 56 pairs of sounds through headphones at a level

of about 50 dB(A). He had to tell whether the first or the second member of the pair

was the more annoying sound. Each sound lasted 3.5 s, there was a pause of 1.5 s

between the members of the pair and each pair was followed by a response pause of

4 s. The 56 pairs covered all possible pairs of 8 test sounds, twins excepted but

reversals of order included.

The subject was instructed to judge the sounds as if he were exposed to them while

relaxing at home. The test sounds are given in table I.

Description of test sound Code Equivalent level

Natural Equalized

dB(A) dB(A)

Normal sound of refrigerator inCONtinuous operation CON

RUMbling version of CON,waxing and waning RUM

Normal ONset of refrigerator NON

Normal OFfset of refrigerator NOF

Processed ONset with "improved"character PON

Processed OFfset with "improved"character POF

White NOIse, reference signal NOI

CON, Amplified by about 16 dB,serving to gauge the scale COA

Table I.

39

52

47

39

40

33

41

56

39

41

39

40

40

44

40

55

The test sounds were used in two similar experiments. In the first experiment the

sounds were presented at the natural levels (save NOI and COA) but in the second the

36 sound levels were changed so as to make them more or less equal (COA and POF excepted).

Natural levels Equalized levels

POF NOI CON PON NOF RUM NON eOA CON NOI RUM NOF POF NON PON eOA

POF --- 15 23 23 26 27 28 30 CON --- 11 20 17 19 24 26 29

NOI 15 --- 25 20 24 28 22 26 NOI 19 --- 17 17 16 17 18 25

CON 7 ' 5 ' --- 15 19 27 28 27 RUM 10' 13 --- 16 15 19 25 30

PON 7 ' 10' 15 - -- 19 23 24 23 NOF 13 13 14 - -- 13 13 16 29

NOF 4' 6' 11 11 --- 24 23 27 POF 11 14 15 17 - -- 15 13 24

RUM 3' 2 ' 3' 7 ' 6 ' --- 18 27 NON 6 ' 13 11 17 15 --- 14 30

NON 2 ' 8 ' 2 ' 6 ' 7 ' 12 --- 24 PON 4' 12 5 ' 14 17 16 --- 28

eOA 0' 4' 3 ' 7 ' 3 ' 3 ' 6' --- eOA 1 ' 5 ' 0' 1 ' 6' 0' 2'---

Z 38 50 82 90 104 144 149 184 l: 64 81 82 99 101 104 114 195

a 18 24 39 43 50 69 71 88 a 30 39 39 47 48 50 54 93

Tables II and III. Voting tables for the experiments with natural (II) and equalized(III) levels. Each entry gives the number of times that the sound above the column wasvoted more annoying than the sound to the left of the row. Apostrophes indicate ratiossignificantly different from 15/15 (P<0.05). a is the normalised annoyance measure.

The results are shown as 'voting' tables II and III in which the sounds have been

arranged in order of increasing annoyance. For example, in table fI RUM was voted 24

times to be more annoying than NOF, whereas NOF was only 6 times judged to be more

annoying than RUM. This ratio is significantly different from 15/15 at a level lower

than 1%. The theoretical maximum sum of annoyance votes for one test sound, 210, is

used as a divisor to obtain the normalised armoyance a/100. Thus, a ranges from 0 to 100.

0

90 at EQUALIZED LEVELS

80

C' 70zl-et 60~

w 500z~ 400zZet

40

LAeq 10

o50 60 30 40

dBlAlEQUIVALENT A· LEVEL

50 60dBtAl

Figs. 2 and 3. Annoyance rating a of test sounds as a function of L . Linear regressionlines a = 2.9L - 75,a= 3.2LA - 84 respectively, are heavy lines. T~in lines witharrows connectAsimilar sounds before and after processing (NON, PON; NOF, PO F) oramplification (CON, eOA). 37

Figures 2 and 3 present the 'natural' and the 'equalized' experiments respectively

in an LA-a diagram. Regression lines are drawn that minimize the sum of squared a

deviations. From these figures one can make the following conclusions.

Results

1. The A-weighted sound level LA has a preponderant influence upon the annoyance of

the test sounds. dB(A) corresponds to roughly 3 points in the centesimal annoyance scale.

2. As a rule, the continuous sounds turn out to be less annoying than the onsets

and offsets of equal level. For the normal, unprQcessed sounds the difference

corresponds to about 3 dB(A). It is likely that this result is an underestimation,

since the effect of startle is probably more important in real life than in the

laboratory situation.

3. Processing the onset and offset sounds does improve their character. The average

effect is equivalent to 3 dB(A). This may be an underestimation for the reason

mentioned above.

A general remark is justified as to the feasibility of this type of experiment. Al

though the subjects listened with headphones in a soundproofed booth to sounds as

short as 3.5 s, they were fairly consistent in their responses. Inconsistencies centred

around the white noise that was considered by some of the subjects to be a very alarm

ing refrigerator sound.

Conclusion: annoyance ratings of a rather low-level sound such as that produced by a

refrigerator has been shown to be mainly dependent upon the A-weighted level, but to

a minor extent also on the sound character.

Acknowledgement

The assistance of G. Doodeman in preparing the stimulus tapes is gratefully recorded.

Summary

The concept of sound character is introduced as a physical attribute responsible forany systematic differences in annoyance due to different sounds at the same A-weighted

sound level. It is thought that this sound character is more important at low soundlevels than at high ones. A pilot experiment with refrigerator sounds does indicate a

clear, though slight effect of sound character. Especially the sharp onsets are

shown to worsen the character of the sounds in question.

References

Berglund, B., Berglund, U. and Lindvall, U. (1976) Scaling loudness, noisiness andannoyance of community noises, J.Acoust.Soc.Am., 60, p. 1119-1125.

Botsford, J.H. (1969) Using sound levels to gauge human response to noise, Soundand Vibrations, l' p. 16-28.

ISO R1996, Assessment of noise with respect to community response, 1st ed. 1971,38 Obtainable through the National Standards Organisation.

Izumi, K. (1977) Two experiments on the perceived noisiness of periodically intermittent sounds, Noise Control Engineering, ~, p. 16-23.

Klaassen, J.A. (1971) Fluctuations of a background noise add to its annoyance, in:P. Zonderland (Ed.), Noise 2000, proc. of congresses 5 and 6 of A.I.C.B.,Groningen, Wolters-Noordhoff Press, p. 199-200.

Scharf, B. (1974) Loudness and noisiness - same or different? Internoise, 74, Proceedings of the 1974 International Conference on Noise Control Engineering heldin Washington D.C., New York: Noise/News, p. 559-564.

Viebrock, W.M., Crocker, M.J. and Cooper, W.A. (1975) Loudness evaluations ofelectric clock noise, Appl. Acoust., ~, p. 193-201.

Zwicker, E. and Feldtkeller, R. (1967) Das Ohr als Nachrichtenempfanger, 2e ed.,S. Hirzel Verlag, Stuttgart, p 184-203.

39

40

Speech

An experimental system for man-machine communication by meansof speech

H.F. Muller, S.G. Nooteboom and L.F. Willems

Introduction

In many communication situations, speech is the fastest, most natural, and most

flexible medium for the exchange of factual information between people. For computers

this may not be true. It seems reasonable, however, when we consider possible ways of

man-computer interaction, to ask whether we can make computers speak and understand

speech. Studying this possibility is the more urgent as the amount of people dealing

with computers in their daily life is rapidly increasing and will soon include not

only professionals but also the general public.

Advanced techniques for computer voice read-out of stored information and automatic

recognition of spoken commands, have been applied in the laboratory to man-machine

communication by voi~e, for instance automatic booking of travel reservations,Flanagan (1976).

More ambitious attempts, aiming at applications in a more distant future, and involving

automatic recogni tion and understanding of whole sentences, have b"een made wi thin

the ARPA Speech Understanding Project (Klatt, 1977).

In our institute we have recently set up a research project, the main purpose of

which was to consider the possibilities and problems in communication by voice

between a computer system and many users, given a rather simple and not error-free

word recogniser. In this project we have set ourselves the task of making a limited

computer information service in the Dutch language capable of giving the departure

times of intercity trains from Eindhoven railway station in four different directions.

Potential users would be the 40 or so male employees of our institute. The system

was completed within nine months as had been agreed before starting the project.

For a computer to carryon informative conversation with a human it has to have suchfacilities as a speech recogniser, a voice response unit, a data base concerning

the topic of conversation, and a set of strategies by which it knows what to do with

the incoming information and what to say when.

We estimated that technologically the speech recogniser was the most difficult part

to realise. We thought it feasible to build a speech recogniser recognising isolated

words from very limited vocabularies, accepting speech from about 40 male speaker3,

and achieving a reasonable recognition score. From this followed the main philosophy

of the project: to restrict the messages spoken by the user to the machine inconspicuously

to isolated words from very limited vocabularies, by implementing a dialogue structurein the form of carefully chosen questions, in which the system retains the initiative.

Below we will briefly describe the resulting system under the headings of word recog

niser, linguistic form of the dialogue, voice output, and system control structure.

Each heading will be followed by the names of the members of our institute who contri- 41


buted part of their time to that component of the system. Finally we will present

some tentative conclusions.

Word recogniser (Muller, Dobek, van Nes)

Requirements on the word recogniser were:

a. easy to build

b. operating in real time

c. suitable for recognition of isolated words from limited vocabularies of 2 to 9

Dutch words

d. accepting speech from about 40 male speakers

e. recognition not error-free

f. bandwidth 250 - 6000 Hz.

Acoustic processing and input

For acoustic processing we use a filter bank, having 14 filters the bandwidth of

which corresponds to the selectivity of the human ear. Output of the filters, number

of zero crossings and total energy are sampled every 1Oms. In the input phase an

algorithm is applied for detection of beginning and ending of the speech signal from

zero crossings and total energy.

Data reduction

Time normalisation is applied by dividing the total duration of the signal into an

equal number of segments such that the accumulated spectral change within each

segment is equal for all segments. The number of segments depends on the vocabulary

and varies from 6 - 10.

Further data reduction is obtained by reducing the spectral shape of each sample.

This is done by coding the derivative of the spectral envelope with one bit only.

The reduced spectral shapes of all samples within each time normalisation seginent

are then averaged. In this way the spectrum of each such segment is coded in 13 bits.

Finally the total energy and number of zero crossings are averaged per segment, and

the derivatives in the time dimension coded with one bit each. We thus obtain a total

of 15 bits per segment.

Training and classification

The system is trained with 20 male subjects who spoke each word three times. Tha

resulting 60 patterns for each word are reduced in number by the condensed nearest

neighbour method (Cover and Hart, 1967; Hart, 1968).

Actual recognition is achieved by nearest-neighbour classification. In each particular

instance of word recognition, classification is in terms of a limited vocabulary

as determined by the structure of the dialogue.

42 The vocabularies used in this system are:

- four direction words

- three part-of-the-day words- nine day words

- twice nine number names,

and the smallest vocabulary: "ja" and "nee" (yes, no).

Classification leads to three different modes of recognition (except for the yes no vocabulary):

(1) certain recognition of a word(2) uncertain recognition of a word

(3) no recognition.

Unguistic form of the dialogue (Leopold, van Katwijk)

Due to its limited recognition power the system could hardly be provided with flexible

conversational niceties such as the ability to react meaningfully to a user's comments

or questions. It has to retain the initiative, imposing a rigid structure on the

dialogue, and using carefully chosen questions such that the user's answer to each

question is predictably one of a few (2 - 9) isolated words, forming the vocabulary

appertaining to that question. Both the structure of the dialogue and the wording

of the questions have been tried out experimentally.

The questions are directed at a specification of the three parameters needed by the

system, viz. direction, day of the week, and hour of the day. This could of course

be done by means of three questions only. For reasons of efficiency and in order to

break up the vocabularies of possible answers, more questions were used,even in case

of correct recognitions only.

Some of the properties of the dialogue will become clear when parts of the actual

conversation are discussed.

After a monologue in which the system introduces itself, indicating what it is and

does, and promising departure times of intercity trains from Eindhoven, it enumerates

the four cities related to the directions in which intercity trains are running.

This menu-type enumeration is concluded with "van toepassing voor u is ... "

C'applicable for you is .. " ), where the user has to fill in the information by

uttering one of the city names.

Menu-type questions are not the only ones used in the dialogue.

The next step (after confirmation by the system of the recognised city name) is to

establish the day. This is done by first asking a yes/no-question about the most

likely day of departure, namely: "vandaag?" (today?). This very short question

focuses - by so-called conversational implication (cf. Bunt, this issue) - attention

on the day parameter. The user saying "nee" to "vandaag?" is asked: "welke andere

dag?" (which other day?).

The next step, leading to number names for the hours, has to be made in two, first

by asking for the part of the day: "ochtend, middag, avond?" (morning, afternoon, 43

eveningn after which the system asks: "welk uur tussen X en Y?" (which hour between

X and Y?) where X and Yare in the twelve-hour system and where the number of possible

answers is limited by the durations of mornings, afternoons and evenings. X and Yare

5 and 1 for mornings and evenings, 12 and 6 for afternoons.

On the basis of specifications of direction, day and hour, the system consults its

data base and produces as a rule three departure times round the target hour. If the

next question: "wilt u meer inlichtingen?" (do you want further information?) is

answered with "nee" the system breaks off the dialogue with a cheerful farewell. If

the answer is in the affirmative, for reasons of efficiency the system does not

proceed immediately to the beginning of a new cycle.

Instead it asks: "zelfde richting?" (same direction?) ,and - depending - "zelfde dag?"

(same day?) ,and acts accordingly by entering the cycle at a point where new infor

mation is wanted.

As may be evident from the examples given, the system speaks in a somewhat elliptical

style. This turned out to be an improvement over previous, simulated versions where

more elaborate speech was found to be somewhat irritating.

A special comment must be made about uncertain recognition, wrong recognition and no

recognition. With uncertain recognition (determined in the recognition process) the

word to hand is presented to the user in a yes/no-question: "gaat het om X?" (does

it concern X?). In the case of wrong recognition the system has unfortunately no faci

lity to react to protests on the part of the user, who may have to be reconciled with

a trip he never intended to make. If a word is not recognised in three successive

trials, the system refers the user to the railway information office.

Voice output (De Jong, Willems)

The output speech used is digitised and stored real speech. Of course this has the

disadvantage that essentially no rules can be applied to modify the speech waveform,

and therefore it is difficult to obtain acceptable speech by concatenation of units

like words, syllables or phonemes. Speech output from digitised real speech is large

ly constrained to prerecorded whole messages. In our case, due to the occurrence of

variable items such as the days of the week and the hours of the day in the output

sentences, this would have given a very long list of messages. We therefore opted for

a compromise solution in which sentences were prerecorded as wholes, but variable parts

of otherwise identical sentences were inserted by means of speech editing (cf. Willems

and De Jong, 1974). This especially applied to the specification of hours and minutes.

An example of an output message, with prerecorded parts indicated, may be: "(The first

intercity train leaves at)+(seven)+(hours)+(twenty)". In order to avoid undesired dis

continuities at the fragment boundaries (+'s), all items were spoken in the context

in which they had to appear by a trained speaker who took care to keep the intonation

of the frame sentence constant. All recordings were made in the same recording session.

The messages resulting from assembling the prerecorded sentence fragments sounded per-

44 fectly natural.

System control structure (Muller)

The system control structure consists of 3 main parts, a maintenance programme, a

data base, and a control programme.

Maintenance programme

The maintenance programme is an interactive system for implementation and modification

of the data base. It has no further function in the operation of the system.

Data base

The data base contains the structure of the dialogue, the language material for voice

output, the vocabularies for the recogniser, a time table, a calendar, and a clock.

The structure of the dialogue

The structure of the dialogue consists of concatenated trees of questions (put by

the system) and answers (given by the user). Each branch of the tree effectuates a

correspondence between a particular result from the recognition process to ,a sequel

of the dialogue.

The control programme controls the interaction between the machine and the user: it

selects a message for voice output and the accompanying vocabulary for the expected

responses. Then it starts voice output and opens the voice input. Then, again a

following message is selected as laid down in the dialogue structure on the basis of

the received (recognised) answer of the user.

If the information necessary to access the time-table is complete, three departure

times are chosen from it and passed on to the voice output system.

Some tentative conclusions

For the limited goals of this information system the kind of dialogue control structure we used is found to be sufficient to guide users successfully through the dia

logue, in most cases where the word recogniser is not functioning too badly. Generally,

users of the system have little difficulty in adapting their verbal responses to the

requirements of the system. Even so, we feel that for more complex systems we can not

easily extrapolate from our experience with this rather limited ad hoc one, and we

would greatly profit from more basic insight into the mechanisms of human dialogues.

Of course, more complex systems will naturally require a larger variety of output

messa~es, and in that case the use of digitised real speech, perfectly satifactory

in the present system, may become impracticable as voice output. For other systems

one may think of synthesis from prerecorded and analysed words, morphemes or diphones,

depending on the requirements of the system. The acoustic form of such units can then

be modified by rule in order to arrive at acceptable speech quality.

As might be expected, the weakest link of the present system is the word recogniser.

Besides easily reparable shortcomings, such as the absence of the possibility to 45

46

erase incorrect, yet certain, classifications, the following are more serious ones,

specifically concerning the word recogniser: (1) For many of the speakers, performanceis inadequate, to the extent even that they do not always get the desired information.

(2) It is not easy to find out why recognition fails. For future w0rk in this line itseems advisable to take a rather different approach enabling researchers to follow

more closely the behaviour of the recogniser in response to different acoustic signals.

Notwithstanding the shortcomings of this, our first attempt to let a computer speakand understand speech, we have shown to our satisfaction that, in principle, it is

possible to overcome the difficulties stemming from a very limited and not error-freeword recogniser, by using a dialogue control structure that lays constraints on theverbal behaviour of the human participant in a man-machine dialogue.

References

Cover, T.M. and Hart, P.E. (1967) Nearest-neighbour pattern classification, IEEE TransIT, 13, no. 1, January 1967, p. 21-27.

Flanagan, J.L. (1976) Computers that talk and listen: Man-machine communication byvoice, Proceedings of the IEEE, Vol. ~, no. 4, April 1976, p. 405-415.

Hart, P.E. (1968) The condensed nearest-neighbour rule, IEEE Trans-IT, May 1968,p. 515-516.

Klatt, D.H. (1977) A review of the ARPA-Speech Understanding Project, Expanded preprint version of a paper to be published in the Reviews section of the Journalof the Acoustical Society of America.

Willems, L.F. and De Jong, Th.A. (1974) Research tools for speech perception studies,I.P.O. Annual Progress Report, ~, p. 77-81.

The Formator: a speech analysis-synthesis system based on formantextraction from linear prediction coefficients

L.L.M. Vogten and L.F. Willems

Introduction

It is well known in speech that the value of the waveform at a given instant is close

ly correlated with its values at previous instants, and hence represents redundant

information (Flanagan, 1972). Among the many models describing the speech signal more

efficiently the production model based on the linear predictability of the speech

wave has been quite successful (e.g. Fant, 1960; Atal and Hanauer, 1971; sambur, 1975).

This Linear Predictive Coding (LPC) of speech represents the wave form in terms ofrelatively slowly varying parameters which are related to the transfer function of

the vocal tract and to the characteristics of the speech source. The LPC analysis is

in fact the calculation of an Mth-order digital filter, the coefficients of which

are determined by minimising the mean squared error between the actual input sample

and an Mth-order linear prediction of the input sample. From these M coefficients the

speech wave can be resynthesised as the output of the inverse filter with the same

M coefficients, excited by pulses or by noise (Markel and Gray, 1976).

The LPC method has the advantage that only relatively short segments of the speech

wave are analysed in the time domain. No Fourier transform is performed and analysis

can be rather fast. Unfortunately the M filter coefficients are less suitable for

further processing because small errors in the coefficients can result in large

errors or even instability of the inverse filter used for the synthesis.

On the other hand, we know that a description of the speech wave in terms of natural

frequencies of the vocal tract or formants, is a very efficient one. Formants also

change relatively slowly with time (Flanagan, 1970). Hence, if we are able to de

termine the formants from the M filter coefficients the LPC analysis cuts both ways.

The present contribution describes such an analysis-synthesis system based on formant

extraction from the linear prediction coefficients. The system determines 5 formants

from a 10th-order LPC analysis. This "Formator" has been developed at our institute

and provides a powerful tool in phonetic research, because formant- (and also pitch)

trajectories can be isolated, varied, stylised or quantised and the effect of these

manipulations on the perception of speech can be studied. The "Formator" may also

prove useful in application fields such as voice response units, low bit rate vo

coders, speech recognition, etc.

The analysis part of the "Formator" has been implemented in software on our P9202

computer. The synthesis part is a digital hardware synthesiser. First we give a

general description of the analysis part, followed by details of the LPC analysis,

the formant extraction and the pitch extraction, whereupon the hardware synthesis 47


part is briefly described. The second part of this contribution gives some examples

of the practical use of the system for bit rate reduction and for its use as

"Intonator" (Willems, 1966) in phonetic research.

The 'Formator'

A block diagram of the system is shown in Fig. 1. The original speech is digitised with

10 kHz sample frequency, 8 bits per sample and stored on disc with the Speech Editing

System (Willems and de Jong, 1974). Then an LPC analysis program is run, yielding the

coefficients of a 10th-order digital filter and the amplitude parameter. From these

10 coefficients 5 second-order filter coefficients are calculated (each with 2 co

efficients). The pitch period and the voiced/unvoiced parameter are determined in a

separate program. These 13 parameters are then fed to the digital hardware synthesiser

(Rockland 4512) and the remade speech is available. Further details of the system

are described in the following sections.

pitch pitchextraction voiced -

unvoicedoriginalspeech amplitudein

stored LPC formant F1, B1F2 B2

speech analysis extraction F3 B3F4 B4F5 B5

hardware

synthesis

remadespeechout

storedspeechout

Fig. 1. Block diagram of the "Formator".

The LPC analysis

From the stored speech a 25 msec segment (250 samples) is triangularly windowed and_1

pre-emphasised by a first-order filter l-~z with ~=0.90. Then the 10 filter co-

efficients are determined by solving a set of 10 simultaneous equations which results

from a least squared criterion for the error between the actual and the predicted

input sample of the speech wave. In fact the autocorrelation method is used (Makhoul,

1975; Markel and Gray, 1976). After some further calculations which will be described

in the next section, the analysis window is shifted over 100 samples to the next

25 msec speech segment. Thus, every 10 msec the amplitude and the 10 filter coefficients

are updated. This frame rate of 100 Hz is a suitable value for normal speech; in

48 many cases steps of 20 or 30 msec also give good results.

Fig. 2. Time (left panel) and frequency (right panel)representation of a 25 msec speech segment (oneframe). The upper curve in each panel concerns theunwindowed signal, the lower curve is the triangularly windowed and pre-emphasised signal. In the middleof the right panel is shown the spectrum that results from the LPC analysis.

this voiced speech segment (the English vowel /~/ of

are easily discernible.

An example of the analysis result

of one frame is shown in Fig. 2.

The 25 msec speech segment, shown

at the top of the left panel is

windowed, pre-emphasised and then

plotted at the bottom of the

left panel. For display purposes

the corresponding fast Fourier

transforms of the two time sig

nals are plotted in the right

panel. This FFT is not used in

the LPC calculations. The 10 pre

dictor coefficients, which in

fact represent the impulse res

ponse of the digital filter, are

calculated and the corresponding

FFT spectrum (also calculated for

display purposes only) is shown

in the middle of the right panel.

It illustrates how the spectral

envelope of the filter fits that

of the {lower) speech wave. For

the word "call") the 5 formants

20 dB

10 20

TIME (msec)

o

wC::::).....JQ.

~

(1)

Formant extraction from the LPC data

The digital filter determined with the LPC analysis program is characterised by 10

filter coefficients {ak } and can be presented in the z-domain by

10 -kA(z) = 1 + L: akz

k=l

The polynomial (1) can also be written as a product of 5 quadratic terms:

5 -1 - 2A(z) = II (1 + Pi z + qi z )

i=l(2)

Calculation of the coefficients {po ,q.} from the coefficients {ak } can be done numer1 1

ically. Then we have a set of 5 {p.,q.} combinations representing a cascade of 5 di1 1

gital second-order filters equivalent to the 10th-order filter. These 5 second-order

filters can now be conceived as the 5 formants that we are looking for.

However, we are still left with two problems: (a) the pairs {Pi,qi}resulting from

the calculations are not naturally ordered on a frequency scale, while the formants

definitely are and (b) it is possible that, especially for consonants or fricatives,

one or more of the pairs {po ,q.} represent a filter whose poles are real. In that1 1

case we can not speak of a formant having a tuning frequency and a bandwidth.

These problems are solved by the application of a transformation procedure to the 49

{Pi,qi} pairs so that ordering on a frequency scale becomes easy. After that orderingprocedure p and q of each pair are limited to such values that they always correspondto complex pole pairs. We shall not go into further details here but only remark thatthese changes have no audible effect upon the spectrum of the ultimately resynthesised

speech segment. This is illustrated in Figs. 3a, band c, where examples are shown

of spectra in which only 4 peaks are present. In Fig. 3a (the upper curve) the

second formant is missing in the spectral envelope, owing to a real pz,qz pair. Ifwe force this pair to values corresponding to a complex pole pair with a large bandwidth the lower spectrum results in Fig. 3a. No difference can be discerned betweenthe upper and the lower curves.

Another example is shown in Fig. 3b where the third formant was missing and the

higher formants have a large bandwidth, so only 3 peaks are discernible in the speccrum.Making the P3,q3 pair complex causes the lower curve of Fig. 3b to be somewhatsteeper in the region above 4 kHz than the original spectrum (upper curve). Comparable

results are shown in Fig. 3c, where the fifth formant was missing.

U.IC:;)

!::...Ia.::E<C

a b c

o 1 2 3 4 5 o 1 2 3 4 5 o 1 2 3 4 5

FREQUENCY (kHz)

Fig. 3. The spectrum of the digital filter resulting from the LPC analysis before(upper curves) and after (lower curves) the formant extraction. In panel a thesecond formant is "missing", in panel b the third and in panel c the fifth formant(upper curves). They are "artificially" added by changing the second-order filtercoefficients so that the real poles become complex, yielding a formant with a largebandwidth (lower curves).

These examples illustrate that the errors introduced by the "forced complex making"

procedure have little or no effect upon the spectral envelope of the resulting digital

filter. The result is now that we always have 5 and only 5 formants and after the

definite assignment of numbers 1 to 5 inclusive, the complex and ordered {Pi,qi}pairs are used to determine the input parameters for the digital hardware device inorder to synthesise the speech wave.

Pitch extraction

The pitch period or fundamental frequency and the voiced/unvoiced decision are determined every 10 msec from a speech segment of 35 msec. This segment is long enoughto ensure that at least Z pitch periods are present in the waveform. For the pitch

50 extraction we use a modified version of Sondhi's (1968) method. First the spectrum

of the speech segment is flattened by a dynamic centre-clipping procedure and then

the "auto-sign-correlation function" (Rabiner, 1977: method 6) is calculated. One

of the maxima of this function is taken as the pitch period, provided it is positioned

within a specific interval. Position and width of this interval depend on position

and magnitude of the previous maximum. A high magnitude of the previous peak implies

a salient pitch period and in that case the interval within which the new peak has to

be found is narrow. A low magnitude, on the other hand, goes with a large possible

interval for the new pitch period. Proper choice of the boundaries can avoid pitch

doubling or octave jumps in the measured pitch. This method of variable window

width not only saves calculation time but also takes into account a certain continuity

that is always present in natural pitch contours.

Hardware synthesis

The speech wave can now be resynthesised by a digital hardware synthesiser. This

device consists of a cascade of 5 second-order digital filters excited by a quasi

periodic pulse (voiced sound) or by noise (unvoiced speech). It needs the amplitude,

pitch period, voiced/unvoiced and formant parameters at every pitch period. Since

the parameters in the analysis are calculated at 10 msec intervals, an interpolation

is necessary corresponding to the actual pitch periods. These interpolated para

meter values are then used as the input parameters for the synthesiser (Rockland4512) .

Practical use of the 'Formator'

This section contains a brief description of some possibilities and results of in

formal experiments with the "Formator". As details of the system are still being

improved "objective" test results are not yet presented.

Up to now a direct comparison between the stored input speech and the resynthesised

version has been performed for about 10 different speakers (male and female).

Several seconds of normal speech (Dutch and English mainly, different sentences for

different speakers) were analysed. The raw analysis results were smoothed with both

a running median smoothing over 5 frames and a linear filtering over 3 frames

(Rabiner et al., 1975). Examples of the raw and the smoothed data are shown in Figs.

4a and 4b for the sentence "I don't think it's necessary to call in the doctor".

Amplitude, pitch, voiced/unvoiced parameter and formant frequencies and bandwidths

are shown for a segment of 2 sec, almost the complete sentence. In the experiments

the input speech was immediately followed by the two resynthesised versions, the raw

and the smoothed data. Although, of course, slight differences were audible betweenthe original and the raw or smoothed resynthesised speech, the quality of the remade

speech was good.

Bit rate reduction with the 'Formator'

Once we have a description of the spectral envelope of the speech wave in terms

of ordered {Pi,qi} pairs related to formants, it is easy to quantise these parameters and hence reduce the bit rate. In our case the digitised input speech 51

:I:()IQ.

- -'-"---- ... -a

c

b

2.0

IIII •. d"1'" l'llr

1.0.5

4

2

5

4

1

3

1

'-. ~~O'---""--.-..'--........u...I-.....--..I.....---......._ ......__""'-'_-......._ ......."-......._-'- ~__~

5.------------,.-------------

4

2

3

>o 3ZLLJ 2::)

01LLJc:: 0 ~.....J-__--'-_

U. 5,....-----r--------------------,I""T""----"T""'T-,-n

-NJ:~-

TIME (sec)

Fig. 4. The 13 parameters calculated in the analysis part of the system plotted asa function of time. Upper panel: amplitude, unvoiced marks and pitch contour.Panel (a): raw data from the analysis; the length of the vertical bars is the formantbandwidth (in Hz) divided by 2 so as not to overload the figure. The formant frequencyis in the middle of each bar. Short bars indicate a narrow bandwidth and thus a sharpand high peak in the spectrum. Panel (b): smoothed formant data. Now the formant frequencies of the 10 msec frames are interconnected in order to show the formant tracks.Panel (c) the same sentence analysed at 40 msec steps and then quantised with 28 bits

52 per frame, resulting in a bit-rate of 700 bits/sec and still acceptable in quality.

needs 80 kbits/sec (10 kHz sample frequency, 8 bits per sample). The analysed

speech can be described with about 14 kbits/sec: pitch 8, amplitude 8, voiced/un

voiced 1, each formant frequency 12 and formant bandwidth 12, making 137 bits per

(10 msec) frame. Preliminary experiments with quantisation of the parameters down

to 28 bits per frame (amplitude 3, pitch 6, voiced/unvoiced 1, Fl up to F5 with

respectively 3, 4, 3, 0, 0 bits and Bl up to B5 with respectively 2, 2, 2, 1, and

1 bits) turned out to have aibmost no audible effect upon the resynthesised speech

compared with the unquantised version. Now we have a bit rate of 2800 b/sec. Still

further reduction of information content with little loss of quality can be achieved

by stylisation of the formant trajectories with an approximation by straight lines.

Another possibility is to increase the analysis step width from 10 msec to 30 or

40 msec. This not only considerably reduces the frame rate and hence the bit rate

but also the calculation time. In Fig. 4c an example is shown of the same sentence

as in Figs. 4a and b, but now analysed with frame steps of 40 msec and then quantised

with the same number of bits as mentioned above. This resulted in a description of

the speech with 700 bits/sec and still acceptable in quality.

The 'Formator' as 'Intonator'

Another interesting feature of the analysis-synthesis system is the possibility

of stylising the pitch contours. The natural pitch contour, measured in the ana

lysis part of the system, can easily be replaced by a stylised intonation contour

of arbitrary shape. Thus the experimenter immediately obtains an impression as to

which pitch movements are relevant to the overall intonation pattern (Collier and

't Hart, 1975; It Hart and Cohen, 1973) and which are not, simply by comparing the

speech with the measured pitch contour with an artificial stylised version and

then judging whether they are perceptually equivalent or not.

An example of two perceptually equivalent intonation patterns is given by It Hart

(1977). One advantage of the present system compared with previous "Intonators"

(Willems, 1966) is the better quality of the remade speech.

Summary

We presented the "Formator", a speech analysis-synthesis system based on a

Linear Prediction Coding of the speech wave followed by a formant extraction pro

cedure. At present the analysis is still performed in software and a 5-formant

analysis with a frame rate of 100 Hz takes about 30 times real time. Pitch is

measured in a separate program, taking about 20 times real time. This "Formator"

looks like becoming a promising system, not only for phonetic research but also

in the field of low-bit-rate vocoders, voice response units and speech recognition.

References

Atal, B.S. and Hanauer, S.L. (1971) Speech analysis and synthesis by linear prediction of the speech wave, J. Acoust. Soc. Am., iQ, p. 637-655.

Collier, R. and 't Hart, J. (1971) A grammar of pitch movements in Dutch intonation,I.P.O. Annual Progress Report, &., p. 17-21. 53

54

Fant, G.C.M. (1960) Acoustic theory of speech production, Mouton & Co,'s-Gravenhage, The Netherlands.

Flanagan, J.L. (1970) Synthetic voices for computers, I.E.E.E. Spectrum, October1970.

Flanagan, J.L. (1972) Speech analysis, synthesis and perception, Springer,Berlin.

Makhoul, J. (1975) Linear prediction: a tutorial review, Proc. I.E.E.E., ~'p. 561-580.

Markel, J.D. and Gray, A.H. (1976) Linear prediction of speech, Springer, Berlin.

't Hart, J. and Cohen, A. (1973) Intonation by rule: a perceptual quest, J.Phonetics, 1, p. 309-327.

't Hart, J. (1977) Pitch contour stylisation on a high-quality analysis-resynthesissystem, this issue.

Rabiner, LoR. (1977) On the use of autocorrelation analysis for pitch detection,I.E.E.E. Trans. ASSP-~, p. 24-33.

Rabiner, L.R., Sambur, M.R. and Schmidt, C.E. (1975) Applications of a nonlinearsmoothing algorithm to speech processing, I.E.E.E. Trans. ASSP-~, p. 552-557.

Sambur, M.R. (1975) An efficient linear prediction vocoder, Bell. Syst. Techn.Journ., ii, p. 1693-1723.

Sondhi, M.M. (1968) New methods of pitch extraction, I.E.E.E. Trans. AU-lQ,p. 262-266.

Willems, L.F. (1966) The Intonator, I.P.O. Annual Progress Report, 1, p. 123-125.

Willems, L.F. and de Jong, Th.A. (1974) Research tools for speech perceptionstudies, I.P.O. Annual Progress Report, ~' p. 77-81.

Pitch contour stylisation on a high-quality analysis-resynthesis system

J. 't Hart

Introduction

As has been reported in earlier issues of this report, as far back as the very first

one ('t Hart, 1966) intonation studies have fruitfully been based on the possibility

of making stylised, artificial pitch contours as perceptual equivalents to original

Fo courses by means of the Intonator (Willems, 1966). This Channel-Vocoder based

instrument can easily and quickly be manipulated with respect to location, slope and

duration of changes of fundamental frequency. The artificial pitch contour is dis

played on a large-screen oscilloscope. Within two successive revolutions of the

tape loop with the input message, the selection of some preset changes can be altered

leaving desired reference points of the contour unchanged, thus providing a facility

for direct comparison of two contours.

The apparent possibilities for stylisation have been made plausible by providing some

psycho-acoustic backgrounds for them ('t Hart, 1976): the differential sensitivity

to size, location and slope of frequency changes is rather limited. In the same paper,

however, it is admitted that our usually far-reaching stylisations with standard move

ments often go beyond the threshold of sensitivity estimated there.

A main drawback of the Intonator is that its output signal has poor quality. The

question which then arises is whether this could not be a reason for the larger toler

ances than can be explained on psycho-acoustic grounds. The listener might be ham

pered by this poor quality and thus be unable to concentrate adequately on the pitch

phenomena. Moreover, since a number of the psycho-acoustic background experiments

have been done on the basis of Intonator-processed stimulus material, their results

may be affected by the same lack of quality as well.

Recently an LPC-based analysis-resynthesis system has become available, and one of

its applications is, understandably, that of a high-quality Intonator (Vogten and

Willems, this issue). This provides a direct opportunity for examining the possible

influence of poor quality.

Stylisation on the LPC system

This is a report of our first attempt to obtain a maximally stylised contour for a

Dutch sentence by means of the new system, and to compare it with what had appeared

to be possible with the old Intonator. The sentence is: "lk geloof aan het volmaakte

van al het gebeuren". ("I believe that all that happens is perfect", Helene Kroller

Muller's epitaph), as spoken by a professional broadcaster.

Fig. 1 shows the progressive stylisation applied in the earlier experiment with

the Intonator. The dots indicate the outcome of an objective measurement of Fo '

carried out later, together with the stylisation on the new system. The scattered

dots in the final part originate from erroneous measurements due to low amplitude

of the last syllable. In view of our present aim, no attempts have been made to ob- 55


tain correct measurements for that syllable. The upper solid line (indicated by 1)

is an approximation in which the syllables "-loof", "vol-", "-maak-", "-te", "van",

"al" , "het", "beu-", if made audible in isolation, all sound at the same pitch as in

the respective isolated syllables of the original. In successive stylisations, the

dent in "vol-" (at position 2), the downstep in "-te" (at position 3) and the over

shoot in "al" (at position 4) have been smoothed out. This is not audible when

listening to the contour as a whole. Line S represents "intonation by rule": al

though this is a completely acceptable contour as such, it is clearly distinguishable

either from the original or the other stylised contours, mainly because the peaks

are experienced too low; the standard excursions of 4 semitones are smaller than the

actual ones. This can easily be corrected by local adaptation of the excursions,

but almost equally well by raising the entire contour 2 or 3 semitones (not drawn in

the figure).

0- 0

60

200r-----------------------------------------,

180

t160

140N:I:-120

tiffi100

a90wff: 80

70

5O,.l-.......~__,.'--- ......._+-'--~.,.......__;-"'- .......-'----'-+_...,..---'"__,_,'---'--2+--+-'----'.......,-'--'-7,-'-.,.....--l._"'_-+-.=.....~o 1 ,: 3 sec.

I k 9 lei 00 aan hell v : 0 I m a a k t: e v; a n :a I he It 9 :e bl eu r e n

---~~- TIME (sec)

Fig. 1. The various stylised pitch contours as applied in the earlier experimentwith the Intonator. Line S is "intonation-by-rule"; that contour can easily bedistinguished from the original. Dots indicate F measurements.

o

.....

60

tiffi 100a 90wff: 80

70

200.,---------------------------------------,

t::~140

50l------------_----- ----------~--......o 1 2 3 sec.

I k gel 00 f aan hel v 0 I m a a kl e van a I he t 9 e b e u r e n

--_~- TIME (sec)

S6Fig. 2. The pitch contours used with the LPC-system. Same sentence, same Fo measurements (dots) as in Fig. 1.

Fig. 2 gives the situation with the new system. Again, the dots indicate the Fomeasurement. Line 1 corresponds roughly to line 1 of fig. 1, except for the downstep

in "-te", which is present in line 2, in which the gradual fall after the first peakhas also been replaced by a more rapid one. Line 3 can be characterised as "intonationby rule", with locally adapted excursions. Again, the dent, the downstep and theovershoot have been removed.

In direct comparisons of each of the four stylised versions with the original (3

trained listeners), we could only hear differences if we knew beforehand which of thestylised versions was going to be made audible. These differences were, too high pitch

in "aan het" with the gradual fall of line 1, and absence of difference in pitch between "al" and "het" with the omission of the overshoot in line 3. The knowledge as'

to which contour is presented facilitates concentration on those particular portionswhere differences are expected according to the visual deviations. And yet we were

unable to hear any of the other clearly visible differences.

Conclusion

First of all, we may say in general that the new system provides us with a splendid,

precise and highly reliable Intonator, more versatile than the old one, and with a

much better quality. With respect to the particular problem put forward above, thereseems to be no reason to fear that a high-quality analysis-resynthesis system willimpose more restrictions on the degree of possible stylisation than a poor qualitysystem.

References

It Hart, J. (1966) Perceptual analysis of Dutch intonation features, I.P.O. AnnualProgress Report, 1, p. 47-51.

It Hart, J. (1976) Psychoacoustic backgrounds of pitch contour stylisation, I.P.O.Annual Progress Report, 11, p. 11-19.

Vogten, L.M. and Willems, L.F. (1977) The formator: a speech analysis-synthesissystem based on formant extraction from linear prediction coefficients, this issue.

Willems, L.F. (1966) The Intonator, I.P.O. Annual Progress Report, 1, p. 123-125.

57

Auditory feedback as a factor in disrupted speech production

A.F.V. van Katwijk

Introduction

Does the auditory perception of one's own voice interact with the ongoing speechproduction?

Auditory feedback of speech has been extensively discussed mainly in connection with

stuttering and the effects on speech of delayed auditory feedback (DAF). Cherry

and Sayers (1956) found that stutterers stop stuttering if they are prevented from

hearing their own voices. Fairbanks and Guttman (1958) analysed the articulatory

errors of subjects who heard their own voices with delays of 0, 0.1, 0.2, 0.4 and

0.8 sec. Disturbance was maximal when the delay was 0.2 sec. Prominent disturbances

occurred most frequently in stressed syllables, and concerned lengthenings and

additions. Repetitions i.e. double articulations made up 70% of the additions, and

were considered the most characteristic feature of the disturbances. They were

interpreted in terms of a temporary restitution of the normal feedback relationships.

Our own observations and assumptions lead us to view repetitions rather as the in

voluntary result of recycling of speech material in articulation.

On the general problem of what one does with the auditorily perceived own speech,

Lane and Tranel (1971) argue in a well-documented paper that the hypothesis of a

direct sidetone control loop lacks adequate empirical support. Instead they suggest

that communicative demands largely determine what one does with one's speech. With

respect to stuttering they go on to suggest that "the experimental findings with

stutterers and normals lead to the same conclusion: the less the sidetone monitoring,

the more normal speech is possible." (p. 699).

The question can be raised whether sidetone monitoring has anything to do with

stuttering. Bloodstein (1975, p. 264) discusses the hypothesis of feedback control,

and doubts its relevance in view of the many facts that are not directly plausible

under it. For instance the fact that "stuttering tends to occur at the moments of

initiation of speech units where auditory feedback is absent".

The literature has many other indications that stand in the way of a simple feed

back hypothesis on stuttering.

Observations and considerations

This contribution deals with some observations on "DAF speech" as recorded during

an experiment with 12 subjects. The material furnishes some pertinent details as

to how auditory feedback signals may run through the speech production lines.

As regards these production lines, we must assume, even without DAF, that there is

a time lag of 0.1 sec or more between the execution of an articulatory programmeand the acoustic result. With DAF, a delay of say 200 ms implies a temporal lag

between execution and perceived auditory signal of 300 ms or more. Note that

Fairbanks and Guttman (op.cit.) have found that the delay duration is positively58 correlated with the number of speech sounds involved in repetitions under DAF. This


span can be related to the number of speech sounds that have been produced before

the feedback loop is closed.

In stuttering and DAF disruptions, these considerations imply that there is a time

lag between production and perception at the basis of a limited number of specific

speech disruptions. Stuttering would in this view - as far as feedback information

could at all playa role - include repetitions of single short segments, of which

the durations agree with the presumed internal production span of at least 0.1 sec.

Morton (1968) speculates that non-stuttering is made possible by inactive periods

in the units that produce motor command patterns.

According to this view we may speculate as to what happens if the ongoing speech

production process is "hit" by the delayed sounds it has recently produced. A

first effect might be in the programming level, where the planned motor patterns is

run into by a competing pattern derived from the external feedback loop. The more

the patterns are different the less likely it is that an effective motor command

pattern will emerge. On the other hand, if the internal and external patterns are

similar, they might both be accommodated in the programme and executed one after the

other. This would lead to repetitions.

The material that has been analysed for this discussion was collected by Riet Dekker,

psychology student from Utrecht, who performed the DAF experiments as a part of

her training programme. I have concentrated on repetitions and lengthenings and on

the syllables in which these take place. We had 12 subjects repeat three-syllable

words in a normal feedback condition and in a DAF condition. The stimulus words

had been recorded in synchrony with a click pattern which had 214 ms interclick

intervals. The delay time of the feedback under DAF was also 214 ms, to ensureapproximately stable phase relationships between syllable durations and delay in

tervals. Precise durations of stimulus and response words and quantitative analyses

of the experimental variables and their effects will be described in due course

elsewhere. In this account we will mainly look into phonetic details of recorded

performances. Listening to repetitions, elongations, pauses and other deviations

an independent judge (phonetically trained) and myself indicated the location of

these deviations for six subjects. Counting only agreed instances, we found that

there were deviations in 2 out of 336 first syllables, in 11 out of 336 second

syllables, in 70 out of 336 both second and third syllables and in 1S8 out of 336

third syllables. This may be interpreted to show that auditory feedback does affect

speech production under DAF, but only after at least 300 ms have elapsed of speech

produced under the planned programme.

Listening again to all twelve subjects I found the qualities of the disruptions of

the second and third syllables characteristically different and revealing:

- Lengthening occurs most frequently and with the largest increments in third syl

lables. There were also duration increments in the second vowel and the thirdconsonant.

- Repetitions were either whole syllables, or vocalic parts of syllables. Repeated

syllables occurred almost exclusively when the target words had identical or

homorganic consonants and identical vowels (Tables I and II). The 18 target words S9

with heterorganic consonants (but identical vowels) gave rise only to two ambiguous

production errors: the target word pitiki (Table III) might in the first place

have been perceived as a word with two or three identical syllables.

Vowel repetitions are classified in Tables IV and V. The insertion of a vowel in the

third syllable has not occurred often, and when it did, there does not appear to be

a distributional condition with respect to place or manner in the articulation of

the consonants, which would favour the error.

d t'ttarge pro uc lon

pubumu pubumum/ ud w

tudunu tudu /nunu

mupubu mupububu

mapababmapa / aba

mipibi .. bl·b·mlpl 1 1P

mabapa mabapapa ( 2x)

nutudu nutududu ( 2x)

natada natadada

nitidi ni tididi

tudunud

tudu /nunu

target production

Ipapapa papapapa (3x)

tatata tatatata

bababa babababa Ox)

dadada dadadada

gagaga gagagaga

mamama mamamama ( 4x)

nanana nananana ( 2x)

Table I. Production errors.Targets with same consonants,same vowels (8 stimuli and12 subjects).

Table II. Production errors. Targets withhomorganic consonants, same vowels (18stimuli and 12 subjects).

target production

pitiki

(same)

pikikiki

pipipipi

Table III. Production errors. Targets with heterorganic consonants, same vowels(18 stimuli and 12 subjects).

Vowel repetitions occurred in the instances given in Table IV and V.

target production target production

pabumi

dinatu

pabumui

dina tau

patuki

badugi

digabu

patukui

badugui

digabau

Table IV. Production errors. Targetswith homorganic consonants, differentvowels (6 stimuli and 12 subjects).

Table V. Production errors. Targets withheterorganic consonants, different vowels(6 stimuli and 12 subjects).

Discussion of lengthenings

The lengthening of the third syllables is the main effect of DAF on the performances

of our subjects. It occurred in well over 50% of the productions. As regards the

mechanism of lengthening at work here, could it be a recycling of vowel information

60 from auditory feedback into the ongoing vowel production? This would be impossible

in view of the temporal limits of the events: on the assumption that a DAF delay

of 200 ms implies a real delay between programme execution and perceived signal of

at least 300 ms, a regular vowel should be finished long before its beginning has

become available via the feedback loop. This applies a fortiori to consonants.

It would seem more realistic to interpret lengthenings as the results of a slowing

down process arising from incompatibility of internal and external signals. The

lengthenings often make the impression of uncontrolled continuation, a kind of free

wheeling.

Discussion of repetitions

The occurrence of syllable repetitions was limited to words where the syllables were

similar. Similarity here means: same place of articulation of consonants, same

vowels (Tables I and II).

The distribution of production errors in the tables cannot be taken to imply more

than an indication that place of articulation and sameness of vowels are possible

conditions for the occurrence of syllable repetitions. ~fuat seems to happ~n is that

the feedback of a compatible syllable is accommodated in the production programme

that is being executed, and then inserted in it. The occurrences of vowel perseve

rations indicates that if the feedback vowel occurs at the proper moment, it is also

included in the ongoing production, and inserted within the planned CV programme.

An obvious next step is to analyse the precise temporal relationships between first

and second occurrences of speech sounds under DAF.

The speech disruptions of real stutterers are for a large part the result of compen

satory and substitutory articulations, which makes ir difficult to analyse what part

might derive from auditory feedback. If auditory feedback plays a role at all in

stuttering, the time interval between programme execution and perception would be

much smaller than in DAF, so that other types of production errors should be ex

pected.

As to the precise role of auditory feedback, Fairbanks and Guttman (op. citl,pointing

to the fact that DAF speech has mainly single repetitions, suggest that a single re

petition would temporarily restore the normal feedback conditions. Our limited data

seem to indicate however that the conditions for the occurrence of a repetition are

very restricted, and that these conditions as a rule have disappeared as soon as a

repetition has been produced, which makes a second repetition highly unlikely.

Summary

The analysis of production errors under DAF shows that the delayed auditory signal

may introduce extraneous commands to the articulators that are at that time in the

act of executing a planned speech programme. The associated errors are repetitions

of syllables or single vowels, where the repetition of syllables appears to occur

only if the delayed signal and the programmed syllable have coinciding similarities. 61

62

The more general disturbing effect of DAF - lengthening - seems to derive fromcoinciding dissimilarities between programme and auditory signal.

References

Bloodstein, O. (1975) A handbook on stuttering, revised edition, Nat. Easter SealSoc. for Crippled Children and Adults.

Cherry, C. and Sayers, B.McA. (1956) Experiments upon the total inhibition ofstammering by external control, and some clinical results, J. Psychonom. Res.,.1, p. 233-246.

Fairbanks, G. and Guttman, N. (1958) Effects of delayed auditory feedback uponarticulation, J.S.H.R., .1, p. 12-22.

Lane, H. and Tranel, B. (1971) The lombard sign and the role of hearing in speech,J.S.H.R., li, p. 677-709.

Morton, J. (1968) Considerations of grammar and computation in language behavior,Studies of language and language behavior, Ed. J.C. Catford VI, p. 499-545.

Vowel length and the perception of prosodic boundariesD. Bouwhuis and J. de Rooli

Introduction

In the context of a research project set up to investigate the relative contributions

of temporal structures and pitch contours to the perception of prosodic boundaries,

perceptual measurements, designed to establish the contribution of vowel length

have been carried out. In an ambiguous utterance with 3 potential prosodic bounda

ries, durations of the 3 relevant vowels have been varied, together and separately.

Results of the perceptual measurements will be described. The contribution of each.

individual vowel to the combined effect of the 3 vowels together will be assessed ina quantitative description.

Perceptual measurements

The ambiguous Dutch word string "Daan zei de baas is te laat" (English equivalent

"Dan said the boss is late") can, depending on the location of prosodic boundaries,

be perceived as meaning either that according to the boss Dan is late (reading I:

Daan, zei de baas, is te laat) or that according to Dan the boss is late (reading II:

Daan zei, de baas is te laat}.

A spoken version of reading I was processed by means of a computer-controlled channel

vocoder; thus we were able to replace pitch fluctuations by a slowly declining pitch

and to vary the durations of the vowels in "Daan", "zei" and "baas", respectively.

The combined contribution of these 3 vowels was investigated in the following way.

Seven versions of the utterance were constructed; the 1st and 7th versions, with

respect to the durations of the 3 vowels had the durational organisation of spoken

versions of readings I and II, respectively. In the 1st version the 1st vowel was

252 ms, the 2nd 130 ms and the 3rd 223 ms; in the 7th version durations were res

pectively, 162 ms, 190 ms and 163 ms. The other versions had in-between values:

the 1st vowel decreased in steps of 15 ms, the 2nd increased in steps of 10 ms and

the 3rd decreased in steps of 10 ~ from the 1st to the 7th version. All other seg

ment durations were equal in all versions.

Stimulus sentences thus obtained, were presented in random order, 10 times each, to

10 listeners for the identification of the reading. From this identification the

perception of the concomitant prosodic boundaries has been inferred. Results of 3

different presentations are given in fig. 1. They show that changing the durations

of only the 3 vowels reliably alters prosodic boundary perception.

At the same time the relative contribution of each of the 3 vowels to prosodic

boundary perception was investigated. This was done by varying only one of the

vowel durations at a time and keeping the other two at the values of stimulus number

4. These values had been found (in an experiment not described here) to correspond

to the estimated 50% cross-over point between both sentence readings.

It may be seen that the 1st and 2nd vowels are equally effective in determining


63

boundary perception but less so than the combined 3 vowels. It can also be seen

that the 3rd vowel has no effect at all.

lOSS

o 1 st vowelx 2nd vowel• 3d vowel

lOSS

OL....,.---~__.._-_-~-..._l

100,-----------%

-oc:.2...oQ.o...Q.

1 2 3 4 5 6 7 1 2 3 4 5 6 7

stimulus number stimulus number

Fig. 1. Proportion of reading I responsesas a function of vowel durations in theitalicised words of the ambiguous wordstring "Daan zei de baas is te laat",in 3 different presentations.

Fig. 2. As in Fig. 1. However, now only oneof the vowel durations has been varied ata time, the other 2 having the values ofstimulus number 4.

A quantitative analysis

In this analysis we want to assess whether the 1st and 2nd vowels contribute indepen

dently of each other to the combined effect found in the 1st experiment. The whole

utterance might be conceived of as a Gestalt-like pattern in which the effect of all

relevant vowel durations together is superior, or even incomparable to the effect

of the separate vowels. At the other extreme the combined effects of single vowel

durations might fully account for the results obtained by the simultaneous duration

variations. In between the separate effects may possibly reinforce each other in a

kind of interaction which increases their effectivity. Of all three, the second

condition would seem attractive from a theoretical standpoint, since any type of

dependence must be specified. Furthermore, the paucity of the present data does not

allow a large number of assumptions to be made. Therefore the independent processing

model will be explored here, and it will actually be found that more complicated

models cannot be tested on the results.

There are then two main applicable models, the Choice model (Luce, 1959) and the dis

tance model following the reasoning of Thurstonian scaling (Bock and Jones, 1968).

The nature of each of these models will be briefly discussed and a data fit presented.

The presented analysis is the description of a first stage of data analysis. New

64 results based on another estimation procedure becoming available after the time of

closing for press, in line with the published ones, will be presented elsewhere.

The choice model

This model describes the perception of one reading as the result of choosing between

2 response strengths, 1 for each of the 2 possible readings of the presented ut

terance. Each presentation of the utterance in some physical form is supposed to

change the response strength of a given answer. In unambiguous utterances this

would lead to an increase of the corresponding response strength, compared to which

the strengths of all other possible responses would be negligible. For the present

utterance 2 responses are feasible, and what the choice model (and the distance mo

del) describe is the probability with which one or the other has been chosen. With

out auditory information the probability of reading I might be expressed as P(I) =

1/(1 + S). The response strength of reading I is taken here to be equal to 1, S being

the response strength of' reading II. Any auditory information relevant to one of thereadings would tend to change these parameters. In the Choice model this occurs in a

multiplicative way. Upon hearing the varied length of the vowel in /d a:n/ the res

ponse strength for reading I would be 0., i indicating the particular length, ° being~

the product of the original response strength 1 and the effect of variation of the

vowel length. In this condition the probability of choosing reading I would be

P(II(Daan).) = 0./(0. + S).~ ~ ~

For all 7 vowel variations a choice probability has been obtained, by which 7 of the

response strengths could be estimated. This, however, precludes an estimation of the

a priori response strength of reading II, the value of S. Looking at the variation

data of the 3rd vowel /ba:s/ it seems that its length contributes little or nothing

to the choice of one particular reading. The slight scatter could also be wholly due

to stochastic fluctuations. For the present purpose the effect of vowel length in

/ba:s/ will be assumed as fixed. The factor B can then be estimated from the average

choice probability in situations where only the length of the vowel in 'baas' was

varied, making it more reliable than the factors 0. The response strength factor Bis thus found to be about 1.3,slightly higher than that of reading I. Obviously a

slight criterion shift took place in the choice process with respect to the earlierexperiment where 'neutral' vowel lengths had been chosen to yield a choice probabi

lity of 50%.

Hence, by obtaining the value of B, all other values for the response strengths can

be easily estimated from the corresponding choice probabilities. In this way 7 esti

mates of the response strength for 'Daan' are obtained as well as the s factors de

noting the response strengths for 'zei'.

The effect of the two vowel lengths,o and s,must also combine multiplicatively. Thus

for a particular combination of vowel lengths denoted by i, the probability of

choosing reading I is

P(II(Daan, zei, baas).)~

realising that the effect of 'baas' is negligible. The effects of length variations

of the first two vowels are completely contained in 0i and si' the factor B remainingconstant. 65

The distance model

Here the effect owing to vowel length is

assumed to result in an internal represent

ation strength which is a stochastic vari

able. It may take on several values for

the same input at different times and

its distribution is assumed to be normal.

as those for variations of only /ba:s/.

So parameters for these latter variations

have been adapted, not really estimated.

oooo

1001...---------------,%

The listener has an internal criterion

such that when the representation strength

exceeds it the response reading I will

follow, otherwise reading II. When theutterance containing given vowel lengths has been presented a number of times the

criterion of the listener will sometimes have been exceeded and sometimes not, de

pending on the exact vowel lengths. The relative frequency of occurrences of

reading I responses is an estimate of the probability that the internal representation

strength had a value as high as or even higher than the criterion. Hence this proba

bility may be seen as the area under the normal distribution curve to the left of

the criterion, where the mean of the distribution corresponds to the average of the

representation strength produced by the vowel lengths. This average reflects the

relative strength on the response axis produced by the vowel lengths, each vowel

length leading to a different value. In such a model the increases in representational

strength are thought to be additive and consequently effects caused by two vowels

must be added on the strength dimension.

These predicted probabilities should then be the same as those found in the experi

ment where all three vowel lengths were varied. The fit of these predicted probabilities is shown in fig. 3 as the uninterrupted line. There are some minor non-system

atic deviations but the overall fit is quite good. The proportion of explained vari

ance is .986, which is about maximum considering the sampling variance of the present

data. In fig. 3 are also shown the assumed

probabilities for 'baas', which werefixed as a consequence of the estimation

procedure. Taking these data into account

too, the proportion of explained variance

decreases to .969, which is still quite

acceptable. This prediction therefore

takes into account both probabilities

for simultaneous vowel variations as well

Fig. 3. Observed and predicted reading Iresponses as a function of stimulus number.Circles refer to responses averaged overthe 3 presentations of Fig. 1. Boxes referto the 3rd vowel variation only of Fig. 2.

- 00 0

c:::.2 gl observed 0-...0 choice model

"Q.0 distance model 0............... '" "Q. '"

01 2 3 4 5 6 7

stimulus number

For estimation of these strengths the effects of vowel length in 'baas' will again

be taken to be fixed at .44, this value being denoted as b. The effect of vowel

length in 'Daan' can then be designated as d. - b, the corresponding effect of 'zei'1

by zi - b. For any prediction of the effect of both vowels, the influence of 'baas'

and the rest of the utterance is necessarily present, so that the prediction equation

66 is:

s (I I (Daan, zei, baas)i) = d i - b + zi - b + b

Here s is the predicted representation strength from which the response probabilitycan be immediately deduced.

The predictions of this model are shown in fig. 3. along with those of the othermodel. It can be seen that in the middle part of the curve the predicted values

of both models coincide. Only the predictions of the end points are somewhat tooextreme compared to the experimental values. The proportion of explained variance

is almost the same: .985; together with the assumed fixed values for the effect of'baas' it amounts to .967.

Discussion

It is not unexpected that both types of predictions are so closely related sincethe function of the choice model, often called the logistic, is quite similar to

the cumulative normal distribution and is sometimes used as an alternative. Except

for slight differences at the extremes the correspondence here also seems satis

fying. The good fit of both models suggests that the effects of different vowels

are independent of each other. Within the framework of the models the contributionto the final response of one vowel length is not conditional upon that of another

vowel length. Such a phenomenon supports the view that the operations involved inutterance interpretation are basically simple.

On the other hand ~he estimation procedure employed here does not allow the existence

of other parameters possibly defining dependence between vowels. The number of ob

tained values completely covers the number of parameters for effective vowel length

except for those in 'baas'. The present data can therefore not offer any information

on higher-order models. However, this seems unnecessary in view of the good corres

pondence found in the application of the independence models. Finally, the distinction

between multiplicativity and additivity as employed in this context possibly deservessome additional comment.

We note that in the choice model the parameter 0 may also be written as eD, where e

is the base of the natural logarithm and D represents the strength owin2 to the vowellength. Consequently the product o~ may be written as e D. e Z= e D

+Z, where additivity

applies in the exponent. In the cumulative normal distribution the nature of the

d- and z-parameters is somewhat analogous, though an analytical expression for itdoes not exist.

The most important conclusion reached from application of both models is that, inspeech, different units contribute independently to the interpretation of an utter

ance. In the present experimental data, independence can only be established by

means of a model since no direct measures are available.

The productive value of the independence is that the operations in speech perception

are sim~le in structure which is attractive to further experimentation.

67

68

Summary

In a perceptual experiment, vowel length variations in an utterance could determine

prosodic boundary perception. Boundary perception was inferred from the interpre

tation of the utterance. There were 2 conditions, the length of a single vowel was

varied systematically in one only, in the other the length of all 3 vowels concerned

was varied simultaneously.

From a quantitative analysis it was found within the chosen framework, that the

separate vowel lengths contributed independently of each other to the perception of

boundaries in speech.

References

Bock, R.D. and Jones, L.V. (1968) The measurement and prediction of judgment andchoice, Holden-Day, San Francisco.

Luce, R.D. (1959) Individual choice behavior, J. Wiley & Sons, London.

The perception of English intonation by Dutch and English listenersR. Collier

Introduction

The present report concerns the initial part of a research program on the comparison

of Dutch and English intonation. Our intention is to subject English intonation to

the same perceptual analyses as have led to an advanced knowledge of Dutch intonation

('t Hart and Cohen, 1973; 't Hart and Collier, 1975).

The problem

In the study of the perceptual aspects of Dutch intonation we have, among other

things, encountered the following problem. From the acoustical point of view unlimited

variability is noted in the course of the fundamental frequency, but some of these

physical variations are not perceived in the perceptual analysis. In the acoustical

resynthesis of an utterance one can therefore smooth the course of the fundamental

without affecting the perceptual equivalence between the original utterance and its

highly stylised copy. However, the degree of perceptual equivalence between an

original and its copy is differently evaluated, depending on the mode of listening.

One can hear differences by lIstening "analytically", but when listening in a "broad"

fashion, as in a normal communicative situation, one's threshold becomes markedly

higher. In the latter situation various analytical differences between pitch contours

(whether caused by stylisation or not) are judged to be of secondary importance and

are considered to be possible variations in the realisation of the same intonational

category or "pattern".

Several experiments have explored the extent to which Dutch listeners can map dif

ferent pitch contours onto the intonation patterns of their language (Collier and

't Hart, 1972; 't Hart and Collier, 1975; Collier, 1975). In these experiments sub

jects were presented with up to 20 different pitch contours which they were asked to

group according to a criterion of melodic resemblance.

The same kind of experiment has been repeated with English-speaking subjects who, in

their turn, had to group 20 pitch contours of their native language. The same ex

periment was also run with Dutch-speaking subjects who were presented with utterances

in a foreign language that they were familiar with, but only to a limited extent.

The following hypotheses were formulated: (1) language users are capable of grouping

20 different pitch contours into a fairly limited number of intonational categories;

(2) the categorisation by the English speakers corresponds to the basic intonation

patterns as described in manuals on English intonation; (3) the categorisation by

the Dutch speakers differs from that made by the English speakers.

The experiment

Halliday (1970) has described in some detail the melodic characteristics of English. 69

IPO annua 1 progress report 12 /977

He distinguishes 7 basic intonation patterns, called "Tones", and indicates for

each a number of variants in their realisation. His description is accompanied by a

set of tape recordings containing examples of the various Tones. From among theseillustrations 20 utterances were selected which represent the 7 Tones in at least2 and at most 5 variant realisations.

2

/'J ~\,,,\ )20

\.\-

17

I~

18

10

12

13

11

3

"",,,-- \I -\.

"4

,,\~\

"5

,,\ , ~r'\

7

"'\ ~Tone 1 : 1,2,3,4,5

JTone 2 : 6,7,8,9

Tone 3: 10 , 11

Tone 13 : 12, 13

"Tone 4: 14 , 15,16

~" Tone 5: 17, 18

Tone 53 : 19,20

Fig. 1. The fundamental frequency of the 20 stimuli and their categorisation in70 "Tone 5" according to Hall iday.

Fig. 1 presents the fundamental frequency of each utterance. The categorisation

proposed by Halliday is the following: utterances to 5 = Tone 1, utterances 6 to 9

= Tone 2, utterances 10 and 11 = Tone 3, utterances 12 and 13 = Tone 1+3,

(a combination of Tone 1, followed by Tone 3) utterances 14 to 16 = Tone 4, utterances

17 and 18 = Tone 5, and utterances 19 and 20 = Tone 5+3.

The 20 utterances were recorded on "Language Master" cards (Bell-Howell). In this

way the subject has immediate random access to the utterances, which he can compare

at will in any order. The first group of subjects consisted of 13 English speakers,

the second of 14 Dutch speakers. They were requested to sort out the utterances

according to the criterion of melodic resemblance. The number of groups into which

to divide the set of utterances was left to their own judgment.

Results

A

- - TTT

I

7 6 9 8 191420135 1 4 3 2 11101817121615

3

9

8

7

6

5

4

10

Counting the number of times each utterance has been grouped with each of the other

utterances, we obtain a score indicative of the degree of melodic similarity among

the individual pitch contours. Melodically very similar utterances form a coherent

pair or group. In such a group a member may also show a (weaker) relationship to a

member of another group. The groups are

thus not neatly separated but stand per

haps in a hierarchical relationship to

one another. This hierarchy can be com

puted on the basis of the scores of the

individual utterances, using an algo

rithm designed by Johnson (1967). Fig. 2

presents the results of the "maximum"

method of the hierarchical clustering

analysis. If two or more utterances join

at a high level in that figure, it means

that their melodic similarity has been

assessed as strong by the subjects.

10 7 613129 8 191416152011 5 4 3 2 1 1817

11

10

9

8

7

6

5

4

B

Fig. 2. Results of the "maximum" method ofthe hierarchical cluster analysis, accordingto Johnson. A: English subjects, B: Dutchsubjects.

English-speaking subjects (see Fig.2A)

One can see that the groupings between

levels 10 and 6 involve virtually only

pairs of utterances. The members of one

pair belong to the same Tone, with the

exception of (14, 19). Only representatives of Tones 1, 2 and 4 are grouped;

the variants of Tones 3, 5, 1+3 and 5+3

never constitute a cluster, not even at

a relatively low level. At level 5 all

the variants of Tone 1 are grouped:

(1, 2, 3, 4, 5). Tone 1 is the only one

whose variants are all grouped at some

level. At level 5 group (12, 15, 16) is 71

also formed, but this group is composed of representatives of both Tones 1+3 and 4.

It is striking that the variants of Tone 1 only cluster at level 5, whereas at level

6 group (13, 14, 19, 20) emerges which mixes Tones 1+3, 4 and 5+3.

An explanation why pairs like (6, 7) or (15, 16) are grouped at a high level can

not be based solely on the degree of physical resemblance as shown in Fig. 1. In

fact, that resemblance is no greater than, for instance, that between contours 12

and 13, which are not taken together. On the other hand, there is little physical

similarity between contour 12 and the pair of contours (15, 16) that do form a

group at level 5.

Dutch-speaking subjects (see Fig. 28)

The formation of pairs by the Dutch subjects corresponds fairly well to that by

the English listeners, but the groupings are made at a higher level.

The pairs involve contours that belong to Tones 1, 2, 4 and 5. The grouping of

more than 2 contours also takes place at a higher level compared with the pre

ceding results. Here, too, the subjects feel that all the variants of Tone 1

should be grouped. Two other groups are distinguished, viz. (6, 7, 10) and

(8, 9, 12, 13). These do not completely correspond to Halliday's classification,

but from the physical point of view these groupings are more plausible than the

heterogeneous mixtures produced by the English listeners.

Some considerations

It is worthy of note that the 2 groups of subjects are in accord on 2 points: (1)

they put all the variants of Tone 1 in one category: (2) they divide the 4 variants

of Tone 2 into 2 distinct groups, viz. (6, 7) and (8, 9). In view of the physical

differences between the 2 types of variants of Tone 2, this subdivision seems

justified and casts some doubt on Halliday's classification. The performance of

the Dutch subjects further suggests, that the distinction Halliday makes between

the variants of Tone 1+3 (contours 12, 13) on the one hand and some of the re

presentatives of Tone 2 (contours 8, 9) on the other hand, i$ open to doubt.

The classification of the Tones proposed by Halliday pretends to be "melodic".

Nevertheless the impression is that Halliday has allowed "functional" criteria to

come into play. In fact, the utterances that illustrate the use of Tone 2 are all

questions, while those that are examples of Tone 1+3 are all assertions. This

might explain why Halliday - perhaps unconsciously - makes a distinction between

(8, 9) and (12, 13), in spite of their melodic resemblance, whereas he groups

(6, 7, 8, 9) together, even though they represent 2 melodically different sub

groups.

The Dutch-speaking subjects make a classification that is more plausible from

the acoustical point of view. Confronted with a foreign language they concentrate

more easily on the purely melodic aspects of the utterances, while the English

72 subjects (and even Halliday to a certain degree) are diverted by the attention

they pay to the interpretative correspondences among the pitch contours.

Finally, in evaluating the subjects' performances we should bear in mind that in

this pilot experiment they were expected to distinguish not less than 7 hypothe

tically different intonation patterns. This may have been too difficult a task.

In previous experiments on Dutch intonation, not more than 3 hypothetical into

nation categories were included in the stimulus materials. This led to fewer

complaints and neater results.

Summary

Both English and Dutch-speaking subjects appear to be able to divide 20 different

English pitch contours into a smaller number of intonational categories. Their

performance differs to a certain extent as a function of their different linguistic

backgrounds. The categorisation proposed by both groups of subjects differs in

important respects from the classification of Halliday.

References

Collier, R. (1975) Perceptual and linguistic tolerance in intonation, IRAL, 1~,

p. 293-308.

Collier, R. and 't Hart, J. (1972) Perceptual experiments on Dutch intonation,Proc. 7th Int. Congr. Phon. Sci., Mouton, Den Haag, p. 880-884.

Halliday, M.A.K. (1970) A course in spoken English: Intonation, Oxford UniversityPre s s, London.

't Hart J. and Cohen, A. (1973) Intonation by rule: a perceptual quest, Journ. ofPhon., 1, p. 310-327.

't Hart, J. and Collier, R. (1975) Integrating different levels of intonationanalysis, Journ. of Phon., ~, p. 235-255.

Johnson, S.C. (1967) Hierarchical clustering schemes, Psychometrika, }~, p. 241-254.

73

The IPO speech squeezing systemS.M. Marcus

Introduction

A previous report has described a program for visual and auditory analysis of digitised

speech (Marcus, 1976). Some extensions to this program, OVID, will be described here

and its place within a flexible system for high-quality selective expansion and com

pression of natural speech. Some research applications will be outlined, together

with a new observation on the effect of segmental duration on pragmatic aspects of per

ceived voice quality.

OVIDOVID (Output and Visual Display) is a program for visual and auditory inspection of a

digitised speech file. It has been described in last year's annual progress report

(Marcus, 1976), since when two important extensions have been made.

First, in addition to the main display window, of from 10 to 200 msec of the digitised

amplitude-time waveform, a second upper display has been added showing one second

of the waveform and the position of the lower window within it. Because of the limited

speed of the display hardware, the upper display shows only every twentieth sample

from the 10,000 in the 1-second section of waveform (sampling is at 10 kHz). Despite

the poor resolution (due to aliasing beats between this effective 500 Hz sampling

frequency and the 5 kHz bandwidth speech signal) the upper display greatly facilitates

program operation. Figures 1, 2 and 3 illustrate typical OVID displays. The displayed

digits indicate the number of samples from the beginning of the window to the vertical

bar cursor, and the amplitude of the indicated sample. Thus the single period of vocal

cord vibration marked by the cursor in Fig. 1 is 80 samples long; at the sampling fre

quency of 10 kHz this is 8 msec and corresponds to a frequency of 125 Hz.

A second facility essential to the implementation of the system which will be des

cribed in the next section is the ability to assign labels to points in the wave

form. These labels are any two characters from the teletype keyboard. Their nature

and location can be assigned arbitrarily, but they are used here to give a crude

phonemic coding to the digitised waveform. Such events as vowel onsets are not al

ways easy to determine, but the visual and auditory inspection facilities of the

program are a powerful aid to the ear of the user.

'Speech squeezing'

A number of systems at the institute have used channel vocoders for speech pro

cessing. One of these is the "Ritmator", which allows changes in segment duration

to be made to speech (Willems and de Jong, 1974). Unfortunately, the quality of

speech processed in this way can be described as intelligible rather than natural.

74 Although more recent work with LPC-vocoders (Vogten, this issue) has resulted in

/PO annual progress report /2 /977

enormous improvement, it was felt desirable for experiments on speech timing to keep

as close as possible to the original speech quality, and therefore to modify the

original (digitised) amplitude-time waveform as little as possible.

1-------------------1sec------------------------1

.. ......... :.. .. .. .." ,.. ""

"..,,:; : ~"":':"J'" ~~: t'., 11:--i;....,!.~ ;i'!i-.l~:.::...~~." Ij... ,. ••• .. .........t.... .. .. ..

1

~.......

-,;

... .

... .-...,.....

. -"""'. ~......... .

1-------------------25msec ----------------------1

CURSORx

80y

102~TIME

~~-----~~t:~, "

1---------12msec---------1

~~;~~~;S:~~:~",fi~\~:L~"~>:~;;~" ,~, "

, ....", ............." ," ......

" ' ...

r.

~<;\':<i~;,··lv:,~'.\·w.,\\~\~\\\1;\ji~;i~;~\\~.\\\~\Jl\ I ; i I I '

________200msec--------1

!:IS 3DB -l

Fig. 1, 2 and 3. Examples of OVID displays illustrating the range of time scalesavailable. 7S

Given the location of comparable points in each pitch period, speech may be compressed

or expanded by deletion or duplication of whole periods. Providing the start and endpoints are carefully chosen to minimise discontinuities, the resulting speech is comparable in quality to the original digitised natural speech. Fig. 4 illustrates how

this process might be used to compress speech by a factor of 2/3 by omitting onepitch period in three.

ORIGINAL AMPLITUDE-TIME WAVEFORM

COMPRESSED WAVEFORM (.2/3 )

,........: .::

."\.,

!:\. \,...'V........,.

~.', .'".: ..,,'""'\.,.......:..

"'.\Of": \,... ,_"•.,..,..,\ :

.'"

.-','

..10msec

' .•

'.

...

Fig. 4. Pitch synchronous compression of a section of speech by a factor of two thirds.

The output of OVID gives the locatiofr and duration of pitch periods, and of similar

duration segments of unvoiced speech and silent intervals which can be compressed in

the same way. The user also has to indicate segments which may under no circumstances

be duplicated or deleted, such as stop bursts or periods involving large amplitude

changes. Segment classification is performed manually by the user, the different seg

ment types being indicated by means of a set of pushbuttons (see Marcus, 1976), and

takes something in the order of 30~60 times real time, depending on the pitch of the

speaker's voice.

Since only whole segments may be duplicated or deleted, the problem of changing the

duration of a section of speech by a desired amount requires an optimum choice out of

the possible segments.

The requirement of attaining a desired duration may conflict with the desirability of

distributing changes as evenly as possible throughout the original waveform. The so

lution adopted is to use a random number generator to select segments and then impose

the requirement that, if that segment is changed, the new section duration must be

closer to its desired value (absolute) than before the change. This continues un~il

a criterion of proximity to the target duration is reached.

In order to make selective changes in segment duration, the labelling facility im

plemented in OVID is used to indicate those sections which should be modified. An im

portant consideration was that in many cases a number of different changes might be

76 attempted on the same sets of segments, which might not lie in a sequence. A simple

experimenter-oriented control language was therefore devised which first allows a

"structure" to be assigned to an utterance, and tlilen permits changes to be made to

elements of this structure. Fig. 5 illustrates an interaction with the program.

First the name of an OVID data file containing both segmental (periodic) and label

information is given. The program indicates that this is information over speech file

SQt001 and then types the labels, the two characters one above another. The user

must then type one teletype character under each label. Each teletype character

defines a group on which operations will be carried out separately. In the example,

all the consonant-like sections have been assigned to group C and all the vowel-like

to group V. Possible operations are to change each section by a specified ratio, by

a specified duration, or to a fixed duration (which may be zero). When all operations

have been given, the command GO results in a new speech file being constructed and

output. With the current hardware this takes about 3 times real time. An optional

print-out gives each label, the group to which it has been assigned, its onset time

in the new version (in seconds), its original duration, its new duration and the

error between this obtained value and that required by the operation on sections in

that group (see Fig. 5.).

Applications

The system provides a powerful tool for experiments with virtually optimum quality

changes in speech rate. Unlike most systems which involve direct modification of the

digitised speech waveform, the two-stage process of analysis and "squeezing" coupled

with the group structure of the experimenter-oriented control language gives great

flexibili ty.

A simple application is the simulation of hardware speech compressors of varying de

grees of sophistication - those involving deletion of fixed duration segments, seg

ments lying between two zero crossings and true-pitch synchronous compressors - and

assessing the trade-off between the cost of such sophistication and intelligibility

and quality.

Monitoring tasks are providing valuable insight into the contribution of many acoustic,

phonetic, linguistic and extralinguistic factors in the real-time comprehension of

speech. Tasks involve measuring the reaction time of subjects in responding to a pre

specified target. Target types which have been used include phonemes (Foss, 1969),

syllables, words and phrases (McNeill & Lindig, 1973), words, rhymes and semantic

categories (Marslen-Wilson and Tyler, 1975) and mispronounciations of single pho-

nemes in words (Cole, 1973).

It is now clear that in the understanding of continuous semantically organised speech,

even monitoring of the first phoneme in a word involves word recognition (Morton and

Long, 1976). We can therefore turn with interest to experiments showing the effects

of rhythmic structure, that is relative timing of sentence stress patterning, on pho

neme monitoring times to real words (Cutler, 1976; Cutler and Foss, 1977) and possi

bly also to similar results for nonsense words (Shields, McHugh and Martin, 1974),

and ask further what effect the actual temporal structure, rather than relative

stress patterning, has on speech perception. It is clear that the system described

above provides the ideal tool for producing the required manipulations in temporal 77

structure. Furthermore, digital speech processing allows the location of synchroni

sation pulses used for reaction time measurement to be precisely determined; the

precise identity of stimuli, differing only by the experimental manipulation, reduces

the danger of artefactual results through differences in two human versions of the

"same" sentence.

! DATA LAUGH

SQ~OOl

£ HELAFTENLAFTILHIZ BELEWIGELAKDELE£ E A A E I E~ eveveevceveevecvexevcvevevevcevcv

e.3/4v.4/3GO

££ ~ TIME OLD NEW ERRH e .000 .034 .025 .000EE V .026 .059 .076 -.002L e .102 .056 .046 .004AA V .148 .231 .307 -.001F c .453 • 112 .082 -.001T c .5)5 .0)5 .029 .002E V .564 .060 .080 .000N e .645 .094 .070 .000L c .715 .058 .041 -.002AA V .756 .224 .295 -.002F e 1.050 .164 .119 -.00)T e 1.169 .052 .037 -.001I V 1.206 .045 .059 .000L e 1.267 .059 .042 -.001H C 1. )09 .072 .05) .000I V 1. )62 .086 • 112 -.00)Z e 1.474 .085 .066 .002

X 1.541 .085 .085 .000B e 1.625 .040 .0)) .00)E V 1.658 .066 .088 .000L e 1.746 .041 .0)) .002EE V 1.779 .121 .164 .00)W e 1.943 .140 .107 .002I V 2.050 .076 .099 -.002G e 2.149 .047 .0)9 .003E V 2.188 .1)2 .176 .000L e 2.363 .141 .108 .002AI V 2.702 • 111 .085 .002DZ e 2.787 • 112 .08) .000E V 2.870 • 119 .163 .00)L c 3.0)) .0)7 .027 .000EE V ).061 .188 .255 .004

FILE SQ~100

QUIT

Fig. 5. Example of program interaction in modifying the phrase "He laughed and laugh'til his belly wiggeled like jelly." Here consonants are shorted to 3/4 oftheir length and vowels extended by one third. The silence before "belly'"is left unchanged in duration.

Finally, an interesting observation has arisen from selective changes in consonant

and vowel duration (Marcus, 1977). Although "uniform" compression of speech produced

by the above system does not produce vowel reduction (Lindblom, 1963), relative

changes in segment duration (Karlsson and Nord, 1972), nor take account of invariances

78 in segment duration important in perception (Marcus, in press), it appears that

large variations in linear speech rate can be made with little perceived change in

naturalness or speaker quality. This contrasts sharply with the results of relative

changes in duration of consonant-like and vowel-like segments. If the relative du

ration of "vowels" to "consonants" is increased by 50%, the overall affect is that

of a very lazy, unmotivated speaker. However, if the reverse is performed the affect

is of a stiff, tense, authoritative voice. This occurs even though overall speech

rate, relative timing, amplitude and intonation is the same for the two utterances.

Thus a change in the pragmatic interpretation of an utterance has arisen from fine

grain segmental timing, a result problematic for models in which prosodic infor

mation is seen as extracted from global cues, segmental analysis being only a final

stage preceding segment classification.

Summary

A system based on semi-automatic pitch-synchronous editing of natural speech is

described. It makes experimental manipulations of either local or global speech

rate easy to produce, and the quality of the manipulated speech is of a similar

standard to the original.

References

Cole, R.A. (1973) Listening for mispronounciations: a measure of what we hear duringspeech, Perception and Psychophysics, 11, p. 153-156.

Cutler, A. (1976) Phoneme monitoring reaction time as a function of preceding intonation contour, Perception and Psychophysics, lQ, p. 55-60.

Cutler, A. and Foss, D.J. (1977) On the role of sentence stress in sentence processing, Language and Speech, lQ, p. 1-10.

Foss, D.J. (1969) Decision processes during sentence comprehension: effects oflexical item difficulty upon decision times, Journal of Verbal Learning andVerbal Behaviour, ~, p. 457-462.

Karlsson, I. and Nord, L. (1972) Stops and CV segment duration, International Conference of Speech Communication and Processing, Bedford, Mass., Paper F5,p. 210-213, New York, IEEE.

Lindblom, B. (1963) Spectrographic study of vowel reduction, Journal of the Acoust.Society of America, 35, p. 1773-1781.

McNeill, D. and Lindig, K. (1973) The perceptual reality of phonemes, syllables,words and sentences, Journal of Verbal Learning and Verbal Behaviour, 11,p. 419-430.

Marcus, S.M. (1976) OVID - a further tool for speech perception studies, I.P.O.Annual Progress Report, 11, p. 31-33.

Marcus, S.M. (1977) The IPO speech squeezing system, Presented to Sussex meetingof the Institute of Acoustics Speech Group, July 1976.

Marcus, S.M. Distinguishing "slit" and "split" - an invariant timing cue in speechperception, Perception and Psychophysics, in press.

Marslen-Wilson, W.D. and Tyler, L.K. (1975) Processing structure of sentence perception, Nature, 257, p. 784-786.

Morton, J. and Long, J. (1976) Effect of word transitional probability on phonemeidentification, Journal of Verbal Learning and Verbal Behaviour, 12, p. 43-51.

Shields, J.L., McHugh, A. and Martin, J.G. (1974) Reaction time to phoneme targetsas a function of rhythmic cues in continuous speech, Journal of ExperimentalPsychology, 102, p. 250-255.

Willems, L.F. and De Jong, Th.A. (1974) Research tools for speech perception studies,I.P.O. Annual Progress Report, ~, p. 77-81.

79

80

Visual perception

Spatial processing of small visual stimuliF.J.J. Blommaert

Introduction

During the past few years much experience has been gained at this laboratory in using

a perturbation technique for gathering information on dynamic properties of the

human visual system (Roufs and Blommaert, 1975; Roufs and Pellegrino, 1976).

In the field of spatial processing of details, Kulikowski and King-Smith (1973)

and Hines (1976) used subliminal summation successfully for measuring line- and

edge spread functions of the retina. They found that the visual system operates

linearly with respect to processing on threshold level of one distinct feature

like a line or an edge. M0reover, line-spread functions measured at eccentricities

of 1.250

and 2.50 0 indicated that the visual system, on the whole, operates in

homogeneously. Therefore lines with their spatial extensiveness do not seem very

suitable as a probe stimulus in experiments where local properties of spatial pro

cessing are being investigated. We propose that for this purpose a small point-shaped

stimulus is the obvious means.

In this paper a first experiment is described in which we tried to measure a point

spread function of the visual system. Results are reported of a second experiment

investigating the linearity of spatial processing in a small area of the retina.

Theory

The perturbation technique is based mainly on determining changes in the threshold

value of a point-shaped stimulus due to perturbation of its response caused by the

response of a faint subthreshold stimulus with properly chosen shape.

We take it that detection of quasi-static stimuli can be formalised by using a peak

detection model, i.e. a stimulus is seen if the extreme value of its response

U(x,y) exceeds a certain level D. The threshold condition may then be written as:

extr {U(x,y)} D

Furthermore, we assume that within a small area of the fovea:

the retina is homogeneous and circularly symmetrical

- the processing is quasi-linear

the extreme value of the response of an infinitesimally small point source

coincides with the coordinates of stimulation.

The response to an arbitrary stimulus may now be written as a convolution integral

with a local point spread function Uo(x,y):

U(x,y) = [Z Uo(x-x' ,y-y')€(x ',y )dx dy

Here, €(x,y) is the distribution of retinal illumination of the stimulus and U(x,y) 81

IPO annua 1 progress report 12 1977

is the response of the visual system.

As in this paper we deal only with circularly symmetrical stimuli, it is convenient

to use polar coordinates:z7f

U(r,¢) = (o C1)

For a small stimulus approximating a point source, with retinal illuminance Ep and

radius r o ' the response pattern becomes

z7f r oUpCr,¢)= Ep f fr'uoClr-r'l)dr'd¢'

o 0

Due to the basic assumptions for determining threshold, only the response for r = 0

is of interest and, so

U Co)p

z7f r oE f fr'UoCr') dr'd¢'

p 0 0C1a)

~EpAp UoCo) if r o is sufficiently small. Here, Ap is the area of the point.

The threshold condition of a point may now be formulated as

D (2)

82

If perturbation of a point response is applied with an arbitrary shape, eq. 2

changes into

Here: UpertCo) is the response in the origin of the perturbating stimulus. Ep,pert

is the retinal illuminance of the point source necessary for detection of the combination.

Of course, eq. 3 is valid only if the retinal illuminance of the perturbation is so

small that detection still takes place by the extremum of the point response.

For the actual experiments we chose perturbation with two differ~nt shapes as shownin Fig. 1.

In the first experiment we used an annulus with radius r a and width ~ra' It can easily

De verified from eq. 1a that for ~ra«ra its response in the origin can be approximated

by

EIf a/ = q, where E ais the threshold of the combination point-annulus, we canEp, a p,

write for the threshold condition of the combination

E {A U~Co)+qA U~Cr )}=Dp,a p u a u a

retinal 1illuminance

-x

retinal 1Illuminance

Fig. 1. Schematic representation of the stimulus configurations used. At the left, apoint-shaped stimulus with perturbating annulus of radus r and width ~r , representing the collection of all points at distance r from the s~imulus. At th~ right, thepoint-shaped probe stimulus, together with a d!sc of radius r d (a kind of negativegoing edge at distance r d from the stimulus).

Comparing the thresholds with and without perturbation, £ and £ , we are able top,a pderive for the normalised point spread function Uo*(r a ):

A £---.E..- {-.E. - 1}

qAa Ep,a(4)

multiplying Uo* by its absolute value

computed from eq. (2).

we can calculate one discrete value of

We can again find the absolute response by

Urr(o) expressed in "D"-units, which can be

So, by measuring the thresholds £ and £p p,athe point spread function. By varying the radius r ais possible to find a number of points for Uo* (ra ).

of the subthreshold annulus it

The second perturbation shape, a disc with radius rd' was chosen in order to check

the basic linearity- and peak detection assumptions.

It can be seen from eq. 1 that for the response pattern of a uniform disc at r=o

can be written (subscript "d" for disc):

If £d q, this leads to the threshold conditionI£p,d

From the threshold£

change p/£ d it can be derived thatp,

A £

-.E.{ -...£ - 1 }q £p,d

(5)

By using discs with different values of r d , we will find a discrete number of samples

for F*(rd), which has a definite relationship with the normalised pointspread function

Uo * (r) as expressed in eq. 5. 83

84

The experimental results may be interpreted as the central response of the retina to

a disc with increasing radius or almost as the central response of the retina to a

negative-going edge with increasing distance from the origin.

The unique relation between F*(rd) and Ua *(r) expressed in eq.5 enables us to check

the above-mentioned basic assumptions, such as quasi-linearity and peak detection incombination.

Methods

The subject viewed an 11 0 uniform field with retinal illuminance of E = 1200 troland

monocularly. Both stimulus- and perturbation shape were projected on top of this

field by using prisms. To facilitate fixation, four fixation lights were projected

around the stimulus in a circle with a radius of 10. A 2mm artificial pupil with an

entoptic guiding system was used. The lights were generated by linearised glow

modulators. The time functions used as an approximation of quasi-static presentation

of the stimuli consisted of blocks of about 300 ms the beginning and the end of

which were smoothed to avoid transient phenomena.

The subject had one knob to release the stimulus, which was delayed by a convenient

preset time interval. Three knobs enabled him to answer with "yes", "no" or "reject".

All thresholds were 50% probability values obtained by a modified "method ofconstant stimuli".

Results

Fig. 2a shows the normalised point spread function for subject FB. The absolute res

ponse, expressed in "D"-units, can be found by mUltiplying the reduced values by the

extreme value given in the legend under the code "norm. constant". This constant is

obtained by averaging the threshold E of the point alone over all 18 sessions of bothp

experiments. In this way we try to acquire an optimal representation of the sensi-

tivity during the whole experiment (a fuller explanation of the statistical procedure

followed will be given elsewhere). Per session, one point of the curve was measured,

each point being calculated from the average of 8 pair quotients according to eq. 4.

The experimentally determined standard deviation of the mean is indicated. The order

of measurement with respect to the r-axis was randomly chosen.

In Fig. 2b experimental data of F*(rd) according to eq. 5 are shown. They were

measured during 11 sessions, making an average of one point per session. All other

conditions were the same as for the point-spread function experiment. The dashed

curves are results of a simultaneous fitting to all experimental data in such a way

that the relation of the two functions exactly obeys eq. 5. The curves were obtained*by averaging the Fourier components of Fig. 2a and those of ~~ in Fig. 2b (the dif-

ferentiation was carried out in the frequency domain), followed by backward trans

formation to ua*(r) and FX(r), respectively.

norm. canst. =6.17 .. 10-3 td-'min-2

subj. FB

E = 1200 td

7~ra

(min. of arc)

65~2 3 ~\ /,-.(

,\\\\

1\\\\\

o.

!

o 1 3 4 5 6 -----1rd(min. of arc)

Fig. 2a and b. Experimental data of a) normalised point spread function Un(r a ). Theymay be interpreted as the response of the visual system to a point of unit energy.b) F*(r ) of eq. 5. It can be interpreted as the response to a disc at its centre,with raaius r of the disc as parameter. The dashed curves are the result of asimultaneous ~omputerfit. They exactly obey the relation between F*(rd) and Uo*(ra )as dictated by eq. 5.

Discussion

It is clearly seen from Fig. 2 that the responses measured are fairly large in com

parison with the spread. The standard deviation, however, varies considerably between

the different data points as a consequence of changes in q- and A-values for the

various perturbations. From eqs. 4 and 5 it can easily be verified that different

q- and A-values lead to different threshold-spread effects on the experimental results.

Within measuring precision, all experimental points fall close to the dashed curves,

which satisfies the basic linearity and peak detection postulates. Other provisional

assumptions, such as circular symmetry, were not tested within the experimental design. 85

86

The obtained point spread function is found to be in accord with the results of Hines,

Kulikowski and King-Smith on the shape of Uo*(r), local linearity and peak detection.

The extensiveness of our point spread function (or line spread function if computed

from it) is somewhat less (zero-crossing at 2 min. of arc, whereas for measured line

spread functions this is about 3 min). A possible explanation of this difference may

be the circumstance that the line spread functions were obtained at retinal illumi

nances of the background that were much less than 1200 td.

It may also be due to differences in choice of the probe stimulus, i.e. a point

versus a line. Where a point is used as probe stimulus, the influence of retinal

inhomogeneities is reduced to a minimum.

Summary

A perturbation technique was used for measuring a point spread function of the visual

system. In combination with an experiment in which the central response to discs with

increasing radius was determined, basic assumptions such as quasi-linearity and peak

detection were tested and confirmed. The extent of the validity of other provisional

assumptions, like circular symmetry and local homogeneity are still to be investigated.

References

Hines, M. (1976) Line spread function variation near the fovea, Vision Res., ~,

p. 567-572.

Kulikowski, J.J. and King-Smith, P.E. (1973) Spatial arrangement of line, edge andgrating detectors revealed by subthreshold summation, Vision Res., 2l, p. 1455-1478.

Roufs, J.A.J. and Blommaert, F.J.J. l1975) Pulse and step response of the visualsystem, I.P.O. Annual Progress Report, ~, p. 60-67.

Roufs, J.A.J. and Pellegrino, J.A. (1976) Gain curve of the eye to subliminal sinusoidal modulation of light, I.P.O. Annual Progress Report, ll, p. 56-63.

Visual recognition by dyslectic children: response latencies for lettersand words

H. Bouma, Ch.P. Legein and ALM. van Rens

Introduction

We have, over a number of years, been carrying out investigations on deficiencies

underlying the poor reading of dyslectic children. The general aim of this research

has been to understand the poor reading in terms of a number of component processes

and mutual relationships between these processes. In particular we have studied

visual word recognition, since this is so obviously poor in dyslectics. One of our

early results was that dyslectic children and normal readers of the same age (control

group) recognize isoZated letters with the same high degree of accuracy from brief

(100 msec) presentations. This indicated to us that the weak readers had an adequate

knowledge of letter forms. On the other hand, in the recognition of embedded letters,

dyslectic children scored much lower than control children. This we interpreted as

increased interferences between adjacent letters. Low recognition scores for words

also distinguished the dyslectic group from the control group and, from an analysis

of individual results, it seemed likely that the poor recognition of words was due

to poor recognition of embedded letters, rather than to inadequate word knowledge

(Bouma, Legein and van Rens, 1974, 1977).

Earlier, we had already shown that in normal adult readers, interferences between

adjacent letters limited word recognition, in particular in the parafoveal visual

field determining the horizontal span of vision or visual reading field (Bouma, 1973).

Also among children, both dyslectics and controls, we had observed a substantial de

crease of recognition scores for embedded letters and words (but not isolated letters),

in parafoveal presentation.

More recently, we have concerned ourselves with time factors in visual word recogni

tion. During fluent reading, eye movements, word recognition, activated memory con

tent and comprehension should have close time relationships. A disruption of rela

tive timings could be just as disastrous for ongoing reading as a deficiency in just

one of the separate factors, such as word recognition. In this spirit, we reported

that presentation of long words in two successive parts gave higher correct scores

than sinruZtaneous presentation in spatially separated parts (Bouma, Legein and van

Rens, 1976). The importance of time factors was gathered from response latencies

in the visual recognition of letters and words, which we will report here.

Experiment

The experiment was concerned with the recognition of isolated letters, letters em

bedded between two letters x, and single well-known Dutch words. These stimuli were

presented one at a time for 100 msec either in foveal or in slightly parafoveal

vision (about 10 eccentricity or 4 letter positions from fixation). The subjects

responded by naming the letter or the word seen. An electronic counter was started

at the onset of the stimulus and stopped by a voice switch reacting to the initial 87


vocalization of the response. By listening to the tapes of the experiment, a number

of artefacts of the voice switch were eliminated. It is difficult to eliminate all

such artefacts (see also the contribution of Schroder to this issue), but we expect

latencies averaged over some ZO responses to be accurate to at least 50 msec - slight

differences are less interesting anyway.

Subjects were ZO dyslectic children and ZO control children aged between 10 and 15

years. Both groups were as described in earlier reports to which we also refer for

experimental details (Bouma, Legein and van Rens, 1974, Bouma and Legein, 1977).

Foveal stimuli preceded parafoveal stimuli and for each stimulus position, isolated

letters, embedded letters and words were presented in that order.

Results and discussion

Average correct scores for the six conditions are presented in Table 1. Essentially

the results are similar to our earlier findings and will not be discussed here.

Average response latencies for correct responses are presented in Table Z. The

average differences between the dyslectic group and the control group have a clear

cut pattern: the dyslectic group is about 100 msec slower in correct letter responses

and ZOO msec slower in correct word responses.

dyslectic control

Foveal isolated letters 97 96 %

Parafoveal isolated letters 94 96 %

Foveal embedded letters 74 91 %

Parafoveal embedded letters 41 56 %

Foveal words 88 98 %

Parafoveal words 58 77 %

Table 1. Correct response percentages asaveraged over about ZO subjects.

dyslectic control

Foveal isolated letters 900 780 ms

Para foveal isolated letters 840 740 ms

Foveal embedded letters 930 830 ms

Parafoveal embedded letters 1050 920 ms

Foveal words 890 680 ms

Parafoveal words 940 730 ms

Table 2. Response latencies of correctresponses as averaged over about 20subjects (No<400-800).

For a possible interpretation, we schematically divide the time between stimulus

onset and response into three serial types of process: visual recognition (including

decision between alt~rnatives), phonemic recoding, and the actual spaaking (Fig. 1),

and collect arguments for the allocation of the observed time differences to the stages.

Working backwards, we know of no evidence that dyslectic children are slower speakers,

and in a situation where they had to repeat spoken words, even long and difficul~

ones, we did not notice much difference between dyslectics and controls. This would

make it less likely that the delays are due to slower speaking. What about the pho

nemic recoding? On the assumption that words require a more complex recoding than

letters, the different delays between letters (100 msec) and words (200 msec) could

perhaps be sought in the phonemic recoding part. However, there are other relevant

observations. Incorrect responses (not shown in Table Z), which involve longer laten

cies for both groups of subjects, show much greater differences between the groups

88 than correct responses. We think it unlikely that, once a stimulus has been incorrect-

ly recognized, it would take much longer to produce a phonemic recoding than it

would for correctly recognized stimuli. Thus, if the cause of incorrect recognition

lies in the visual recognition part, the extra delay cannot easily be allocated to

the phonemic recoding part. This leaves us then with the visual recognition part as

a likely candidate for the delays, and the differences between letters and words

should then also be primarily attributed to the recognition (and decision) part.

AUDITORY

RECOGNITION

VISUALI

PHONEMIC I VOCAL,~

,~

RECOGNITION RECODING RESPONSE

:I

t

Fig. 1. Block diagram of assumed serial stages in the oral reading of letters andwords.

What can we say about the consequences of the observed delays for reading? Since we

know little about timings of component processes in reading, one cannot be definite

on this problem. There is some recent evidence on the time needed for triggering

eye movements in reading, which, in adult readers, indicates that word recognition

in reading takes no more than 200-300msec (Rayner and McConkie, 1976). This corres

ponds well to response latencies for visual recognition of single words by adults,

which can be as low as 300-400msec. Even the normal reading children, however,

remained well above this value. What we can do then is to compare the extra delay

of dyslectic children with the normal recognition time estimates, with normal response

latencies, or with normal durations for eye fixations (lS0-S00msec).

In order to avoid basing this discussion on averages over subjects alone, Fig. 2

correlates individual latencies for foveal word recognition for the individual

subjects with reading level. The rather wide range of individual differences does

not hide the fact that, for a number of dyslectic children the delays are of the

same order of magnitude as the comparison durations for recognition and eye fixations

mentioned above. The conclusion can only be that such delays are incompatible with

normal reading, and that even with a full adaptation of all other component reading

processes to the delayed recognition, slow reading seems the maximum attainable.

On the other hand, some dyslectic readers have response latencies quite similar to

their normal reading peers and their weak reading should have a different background.

Referring to our earlier work in which we showed a number of functions subnormal in

dyslectics (Bouma, Legein and van Rens, 1975), we can only repeat our statement that

detailed insight into the component processes and their interaction in time is

badly needed.

89

grade7

6

5

Q) 4>Q)-l

Cl 3.5't:lIIIQ)

a:: 2

1

o control children

• weak readers

0 0 0 0

86 0 0

000

" "

0 0

• ... • •eo • •

•- -• • •

••

o .5 1.0 1.5 sec.

90

Response Latency

Fig. 2. Reading level versus response latency for foveal words. Note thediversity of response latencies particularly among weak readers.

Finally, let us take a brief glance at some other interesting comparisons in table 2.

Roughly, embedded letters take longer than isolated letters and words, and parafoveal

stimuli take longer than foveal stimuli. This cannot corne as much of a surprise, but

it seems relevant to note that explicit recognition of certain embedded letters cannot

be taken to precede word recognition. Explicit recognition includes decision time

between likely candidates, which is unnecessary for component letters in a word recog

nition task (Bouma and Bouwhuis, 1975). Thus the possibility remains fully open that

word recognition proceeds through prior letter activations. It is puzzling that foveal

latencies for isolated letters are so high, in fact higher than parafoveal isolated

letters. Since the foveal isolated letters carne first in the session, it could perhaps

be a training effect. However, such an effect was not evident from comparison of the

first and the last responses in the list.

Summary

Response latencies for visually presented letters and words have been measured for 20

weak readers (dyslectics) and 20 normal readers, 10-15 years. In their correct responses, many weak readers are consistently slower, the group averages differing by

100 msec for letters and 200 msec for simple words. The extra delay should probablybe attributed to the recognition process itself rather than to phonemic recoding or

speaking processes. Since the extra delays are of the order of fixation pause durations,they can be taken as disruptive in reading.

Acknowledgement

Thanks are due to Messrs A. van Vroenhoven and J. Hupperetz, directors of the schools

involved, and to their staffs, for their kind cooperation.

References

Bouma, H. (1973) Visual interference in the parafoveal recognition of initial and finalletters of words, Vision Research, 11, p.767-782.

Bouma, H. and Bouwhuis, D.G. (1975) Word recognition and letter recognition, I.P.O.Annual Progress Report, ~, p. 53-59.

Bouma, H. Legein, Ch.P. (1977) Foveal and parafoveal recognition of letters and wordsby dyslectics and by average readers, Neuropsychologia,li, p.69-80.

Bouma, H., Legein, Ch.P., van Rens, A.L.M. (1974) Visual recognition by dyslecticchildren:a study on letter and word recognition in foveal and parafoveal vision in20 weak readers and 20 normal readers. I.P.O. Annual Progress Report, ~, p.l04-109.

Bouma, H., Legein, Ch.P., van Rens, A.L.M. (1975) Visual recognition by dyslecticchildren: further exploration of letter, word and number recognition in fourweak and four normal readers, I.P.O. Annual Progress Report, ~, p.72-78.

Bouma, H., Legein, Ch.P., van Rens, A.L.M. (1976) Visual recognition by dyslecticchildren: spatial and temporal separation of wordhalves as recognition aids fordyslectic and control children. I.P.O. Annual Progress Report, 11, p.64-68.

Rayner, K., McConkie, G.W. (1976) What guides a reader's eye movement? Vision Research,.l..§., p.829-837.

91

Backward masking in a reading-like situationu.o. SchrOder

Introduction

In reading, the eye jumps along a line of ~ext, and each jump (20-60 msec) is

followed by a fixation pause of 100-500 msec. It is assumed that during this pause

the information is extracted from the text.

From visual research a phenomenon is known called backward masking, that is when

a given stimulus is followed after a short time by a second one, the first (test-)

stimulus is harder to detect, or even go undetected. It is for this effect of

masking that we call the second stimulus "the mask". The effect of backward masking

depends on the kind of stimulus and mask, and the S.O.A. (Stimulus Onset Asynchrony).

It seems that in more complex stimuli/mask paradigms there must be a greater S.O.A.

to enable an "escape" of the test stimulus. For a two-, four- and six-letter string

followed by a pattern mask Zamansky et al. (1971) for instance found a threshold

S.O.A. of 35, 55 and 100 msec respectively, (computed from their Fig. 2, trailing

mask at 40 cd/m 2). For a letter array with a circle mask, Averbach and Sperling

(1961) found a degraded performance for S.O.A.s of 100-200 msec.

~e asked ourselves if, in a very complex situation like reading or scanning text,

the backward masking would extend to some 150 ms, degrading the incoming information

and thus limiting high speed reading or scanning of text. In this paper we report

on some experiments to establish the extent of backward masking.

Experiments

In the first experiment the stimulus and mask were Dutch three-letter words (fre

quency of occurrence> 10- 5), the stimulus and the mask were presented at over

lapping places at one degree left or right of a fixation point. In a second experi

ment the mask was replaced always by the same 32-letter-Iong sentence consisting of

unpronounceable (nonsense) words.

From both experiments correct scores and response latencies were obtained. Fitting

the response latency data to a normal distribution on both a linear and logarithmic

time scale, the logarithmic time scale gave the best fit. The main response latency

and corresponding standard deviation of each subject is therefore computed on a 10

garithmictimebase and not, as usual, on a linear one.

It was observed further that the standard deviation was proportional to the main

response latency, this observation fits well with a logarithmic transform of the

time scale.

The assumed underlying distribution of the response latencies is used to filter

the data to some extent. After a logarithmic transformation, mean and standard

deviation (s) are calculated, then data outside the interval mean + 1.7 s are

deleted and mean and standard deviation are computed again. We are following this

92 procedure to exclude response latencies unrelated to what we wish to investigate.


This cleaning up in both theory and practice corresponds to deleting 5% of the data,

a method which was developed with the aid of computer plots of the data.

Procedure

The stimuli were presented in blocks of 40. In table 1. are shown the conditions

used in these blocks; within the blocks conditions varied in random order, the

only difference between the conditions with and without a mask being the presence

or absence of black letters in a masking field.

Condi tion S.O.A. Stimulus Masking Field

1 140 ms 20 ms 200 ms without a mask2 140 ms 20 ms 200 ms with a mask

3 100 ms 20 ms 200 ms without a mask

4 100 ms 20 ms 2DO ms with a mask

Table 1. The four conditions which were mixed in one experiment.

Results of string-mask experiment

In this experiment seven untrained subjects participated. The results are summarised

in table 2. From seven subjects we obtain seven mean response latencies and seven

standard deviations. Averaging the seven mean response latencies yields one

"averaged mean response latency" and a corresponding standard deviation; averaging

the seven standard deviations yields one "average standard deviation" and a cor

responding standard deviation.

The influence of a three-letter mask at the same place as an earlier presented three

letter stimulus is visible as a decrease in the percent-correct answers and an in

crease in response latency. The extent of this influence depends on the S.O.A.

(Fig. 1); for a S.O.A. of 100 ms the decrease of the percentage correct is 27%, the

increase in response latency of the correct answers being 16% (130 ms delay of the

correct answers).

~ decrease ~ increase40 in%C(%) 40 inRT(%)

30,

30~word\ mask

20 \\ 20string '~ <1 word mask

10 mask \'10

," string~\

\ mask \'0 010 100 1000 10 100 1000

SOA(ms)~ SOA(ms)~

Fig. 1. The percentage correct score and the response latency are a function of theS.O.A. between stimulus and mask. In the figure decreases in correct score (%C) andincrease in response latency (RT) as compared to the nonmasked condi tion are shown. 93

Condition S.D.A. Mask Average correct Averaged mean res- Averaged individualscore ponse latency of the standard deviation of

correct responses the correct responselatencies

1 140 ms no 57% (14 %) 719 ms (237 ms) 19% (11 %)2 140 ms yes 50% (20 %) 706 ms (226 ms) 17% ( 5 %)3 100 ms no 73% (16 %) 722 ms (238 ms) 17% ( 8%)4 100 ms yes 38% (18 %) 848 ms (200 ms) 19 % ( 6 %)

Table 2. Results of three-letter word maskingNumbers between brackets are standard deviations.

Condition S.D.A. Mask Average correct Averaged mean res- Averaged individualscore ponse latency of the standard deviation of

correct responses the correct responselatencies

1 140 ms no 76 % (15 %) 646 ms ( 112 ms) 16% ( 4%)2 140 ms yes 70 % (14%) 649 ms (125 ms) 15% ( 3%)3 100 ms no 80 % (13%) 648 ms ( 114 ms) 14% ( 2%)4 100 ms yes 56% (22%) 702 ms (145 ms) 18 % ( 4%)

Table 3. Results of nonsense string maskingNumbers between brackets are standard deviations

Results of word-mask experiment

In this experiment 12 trained subjects participated, their results being summarised

in Table 3. Masking with a long sentence consisting of nonwords, a S.D.A. of 140 msagain seems to have little influence on the perception of the stimulus. An earlier

presentation of the mask (S.D.A. 100 ms) decreases the percentage of correct score

by 22% and increases the response latencies of the correct responses by 8% (52 msdelay of the C01'rect answers).

The results of the masked/nonmasked comparison at a S.D.A. of 100 ms are presented

in Fig. 2 as an example of the experimental results.

40

30

20

10

o1 2 3 4 5 6 7 8 9 10 11 12

~ subject

SOA=100ms

SOA=100ms! increasel' inRT(%)

10

o1 2 3 4 5 6 7 8 9 10 11 12~ subject

Fig. 2. There is a great variability between the results of subjects for both the94 decrease in percent correct score (%C) and the increase in response latency (RT).

DiscussionInfluence of stimulus onset asynchrony

From these experiments it can be concluded that a mask presented at a S.D.A. of

140 ms after the stimulus does not really affect the perception of the stimulus; at

a S.D.A. of 100 ms there is marked degrading of the perception of the stimulus.

This holds good for both a three-letter word mask and a long-sentence mask consisting

of nonsense words.

Differences between the two experiments

The differences between the two experiments were:

a) the amount of training of the subjects;

b) the length of, and amount of information in the mask.

The trained subjects of the string mask experiment show a higher correct score in

the no-mask condition (78% versus 65%).

Differences in mask/no-mask condition were roughly equal in the experiments (27%

and 22% difference in correct score, 16% and 8% in response time for a S.D.A. of

100 ms), but the results of the string mask were slightly "better". Perhaps the

subjects had difficulty in distinguishing between the poorly visible stimulus word

and the readily visible mask word.

Relevance to reading

In reading, only 9% of the fixation durations are below 140 ms, and on the basis

of the present results the distribution starts at a level where the influence of

masking in our experiments has already ended. However, in reading, the retinal

image keeps on changing, whereas in our experiment we had only one stimulus andone mask.

SummaryIn a pattern backward masking paradigm a word of three letters is presented to sub

jects, followed some time later by a mask consisting of either a three-letter word

at the same retinal place, or an overlapping, long string of nonsense words. In both

cases there is not much masking at a S.D.A. of 140 ms, but a S.D.A. of 100 ms de

creases the correct-score percentage by 25% and increases the correct response la

tencies by 8% to 16%.

As far as can be judged from these preliminary data, masking effects are already

terminated at the point where the distribution of fixation durations starts.

ReferencesAndriessen, J.J. and De Voogd, A.H; (1973) Analysis of the eye-movement patterns in

silent reading. I.P.D. Annual Progress Report, ~, p. 29-34.

Averbach, E. and Sperling, G. (1961) Short-term storage of information in vision,In 4th London Symposium Information theory, Cherry, C. (Ed.) Butterworths,London.

Zamansky, H.S., Scharf, B. and Brightbill, R.F. (1971), Backward and forwardmasking as a function of number of letters, interstimulus interval and luminance,Journ. of Exp. Psycho!.,~, 2, p. 235-241. 95

Letter cancellation in words and nonwordsH. Timmers

Introduction

Corcoran (1966) introduced a technique for the study of the reading process that re

quired subjects to search for a target letter, namely the letter e, in a coherent

piece of prose. One of Corcoran's interesting findings was the higher omission rate

for silent e's than for pronounced e's; although the actual pronunciation did have

no effect. To explain this result Corcoran suggested that the acoustic image of the

words plays an important role in visual scanning of text. Recently, this acoustic

factor explanation has been scrutinized by some researchers (Schindler and Jacobs,

1976; Smith and Groat, 1977). Schindler and Jacobs suggested that the importance of

the word containing the target letter to the meaning of the sentence might be corre

lated with the probabiLity of missing that target letter. Their data didn't support

this 'importance' hypothesis, but they did find a significant correlation between the

relative frequency of a word in printed text and the number of subjects who missed

the target letter in the word concerned. This result is in agreement with a) the

study by Smith and Groat (1977) who also found more misses in high frequency words

and b) the finding of Corcoran (1966) that the e in the word,the had the greatest

p-robability of being undetected. The experiments by Smith and Groat didn't reveal an

acoustic effect, that is no more omission errors were made for silent e's than for

pronounced e's.

Interestingly enough, recent eye-movement data (O'Regan, 1977) show that "in exactly

the same preceding context where no syntactic predictions are possible, a low-fre

quency verb will attract more fixations than the article the" and "in context where

the word the is highly predictable from the preceding structure of a sentence, it will

rarely be fixated". These data show an interesting correspondence with the result

from the target letter studies suggesting that familiar words and familiar letter

clusters are readily recognised without the detailed processing required to identify

the target letter as a separate unit. On the basis of these studies it can be derived

that the predictability of a word in a sentence and of letters in a word may be an

important determinant of visual scanning of reading material. Consequently, we were

interested to investigate the effects of 1) words versus nonwords in a letter search

task; 2) the position of the target in a word and nonword and 3) the reading level

of the subjects in this task.

Experiment I

MethodsReading material of two different types was used:

a. a page of 105 words; there were seven words to a line and 15 lines to the page.

Words were separated by a blank space. The words consisted of eight letters. The

letter e had to be marked; there were 32 target letters to the page quite homo

geneously distributed over the letter positions. The number of targets per line

96 was varied from zero to four. At the beginning of each line the letter e was


printed which had to be underlined in order to monitor the subjects for the correctline.

b. a page of 105 nonwords consisting of scrambled letters. Again there were eight letters

to a nonword, seven nonwords to a line and 15 lines to the page. The distribution

of targets over the letter positions was a little less homogeneous than on the

page of words. Each subject received the two pages, separated by at least one day.

The subject was given the instruction to mark all the e s he saw on each line.

Thirty-six elementary school children served as subjects, six from each grade

(1-6). The age of the children varied from 6 to 12 years. Across subjects from

the same grade, order of presentation of the two conditions (words vs. nonwords)

was counterbalanced. Reading level of all subjects was individually assessed by

means of the Tanghe reading test. The times taken to perform the task weremeasured.

Fig. 1 shows the error scores averaged over

each grade level for words and nonwords. Sub

jects seem to miss more target letters in words

than in nonwords (Sign-test, p <.01). This dif

ference is the most substantial for the fourth

and sixth grade. For the first, second andfifth grade this effect is less pronounced, forthe third it is even completely absent. From the

fourth grade onwards the absolute level of er

rors clearly starts decreasing in the word

condition; in the nonword condition this trend

already starts in the second grade.

Fig. 1. The percentage of errorsaveraged over each grade for wordsand nonwords.

Results

32 32

28 Words Nonwords 28

24 24

20 20

16 16

12 12

8 8

4 4

0 01 2 3 4 5 6 1 2 3 4 5 6

Grade

In Fig. 2 the percentages of misses are representedfor both conditions averaged over all grades as

a function of the position of the target letter

in the word/nonword. For words the omission rate

is lowest at the beginning (first position) and

at the middle (fifth and sixth positions); for

nonwords such a trend isn't observed and here

the errors are more randomly distributed over the target positions. Only first-grade

subjects don't show this particular effect. We were surprised by the huge error score

at the end of nonwords, but closer inspection of the data showed that this can be

considered as an artefact. On the test page one of the two nonwords, in which the

target is at the eighth position, is followed by a nonword in which a target letter

is present at the first position. Twenty-two of the 36 subjects missed this target

at the eighth position, presumably because their attention was already attracted by

the next letter e. If we correct the scores for this effect, the percentage of misses

is reduced from 31.9 to 5.0. The time subjects spent in performing the test decreases

from 5 min. for the first grade to 2.75 min. for the sixth grade. There is no sys

tematic difference between the times spent on the page of words and that spent on the

page of nonwords; with the exception of the first-£rade subjects who spent 97

32 32


24 24

20 20

16 16

12 12

8 8

4 4

0 01 2 3 4 5 6 7 8 1 2 3 4 5 6 7 IiTarget position

Fig. 2. The percentage of missesaveraged over all grades for each ofthe eight target positions in wordsand nonwords. The double-hatched barat the eighth position in nonwords represents the uncorrected score.

Experiment II

Methods

nearly 5 min. on nonwords and nearly 4min. on words.

In a subsequent experiment the same test

was repeated with a greater number of

subjects in the beginning of the first

class of the elementary school. After

the previous experiment which had a

more exploratory character, we wanted

to ascertain more accurately whether

subjects just starting to learn to read

already process words and nonwords dif

ferently. Another point of interest was

the position effect which was clearly

revealed by the other grades but not

found in the first grade.

The same reading material and the same procedure were used as in experiment I.

Two groups of first-grade subjects, 21 and 23 respectively,took part in the experiment.

Results

32 32


24 24

20 20

16 16

12 12

8 8

4 4

03

01 2 3 4 5 6 7 8 1 2 4 5 6 7 8

Target position

Fig. 3. The percentage of errors averagedover 44 first-grade subjects for each

98 target position in words and nonwords.

Fig. 3 represents histograms of error;

scores as a function of target position

for words and nonwords averaged over the

two groups of subjects. Clearly the results

do not show a systematic position effect

for words. The data suggest that for nonwords the variation of the error scores

due to the target position is less than

for words, but this isn't a statistically

reliable effect. No different error scores

averaged over target positions are ob

tained for words compared to nonwords

(Sign test, p = .37). A strong practice

effect, however, was shown by most of

these subjects: the second session

yielded a substantially lower error rate

than the first (Sign test, p <.01).

DiscussionOur findings may be summarised as follows:

1. more target letters are missed in words than in nonwords by subjects from the

second grade on;2. a target-position effect is obtained in words but not in nonwords; only the

first grade doesn't show this effect.

Subjects do not tend to spend more time on nonwords than on words so that the higher

omission rate for words can't be explained by a speed-accuracy trade off. Even if

the subjects are told to check their marks it will hardly change their error pattern.This simply suggests that in those cases people do not see the target letter when

reading words. Processing nonwords, however, requires the reader to read out the

letter strings that are presented.

The position effect we find in the word data suggests a characteristic of the eye

movements. It seems as if the eight-letter words require two fixations by the rea

der: one at the beginning and one at the middle of the word. This seems to implythat Corcoran's finding, that the later the e appeared in a word the more likely

it was to remain uncancelled, will depend on word length.

The fact that the first-grade subjects don't show the word/nonword effect and the

target-position effect, means that this letter search task is sensitive to the

change of reading ability and- reading processes that occur on the way from first

to second grade.

Subjects from both the first and second grade weren't able to recognise the eight

letter words that were presented. So it may be concluded that the second-grade

children did process the words on the basis of familiar letter-cluster units.

Further study will be necessary to reveal more detailed aspects of the relation

between reading ability and identification of single letters.

Summary

When young subjects (children from 6 to 12 years of age) search for a target letter

in eight-letter words and eight-letter nonwords, they make more omission errors in

the last condition. As far as words are concerned, their error patterns show a

position effect, that is, less errors are made at the beginning and the middle of

the words. First-grade children do not show these effects.

ReferencesCorcoran, D.W.J; (1966) An acoustic factor in letter cancellation Nature, 210,

p. 658.

O'Regan, K. (1977) Moment to moment control of eye saccades as a function oftextual parameters in reading, Paper presented at the "Processing of VisibleLanguage" conference, Eindhoven.

Schindler, R.M. and Jacobs, P.I. (1976) What do we see when we read? Paper presented at the Annual Meeting of the Eastern Psychological Association, New YorkCity.

Smith, P.T. and Groat, A. (1977) Spelling patterns, letter cancellation and theprocessing of text. Paper presented at the "Processing of Visible Language"conference, Eindhoven. 99

Processing of visible language: a Symposium

H. Bouma, D.G. Bouwhuis and H. Timmers

An international symposium on "Processing of Visible Language" was held by IPO in

Eindhoven from 5 - 8 September 1977. The idea for such a meeting had its origin in

contacts between Paul A. Kolers (Toronto) and IPO, and was further worked out by an

organizing committee, including Anthony Cohen (Utrecht), Merald E. Wrolstad (Cleve

land, Ohio), and the authors of the present contribution. There were about 50 active

participants from North America and Western Europe, and 36 papers were presented at

six sessions. Prof.Dr. H.B.G. Casimir addressed audience in the opening session.

The primary aim of the symposium was to establish contacts between three separate

groups who have a profe-ssional interest in the perception and production of visible

languages: a) investigators of human reading, b) graphic designers c) display

engineers. In our times, when new types of reading develop under the influence of

new technologies, the time seemed fit for a concerted reflection on current and

future ways of presenting visible information.

Consequently, sessions were concerned with three themes: a) research on reading

processes b) production of graphic language c) technological display systems.

Insight into human reading has been notably advanced by recent research on pro

cesses that are supposedly involved in the reading of text. Such research has tra

ditionally concerned normal print on paper and, to the extent that, more recently,

electronic text displays have been used, this has been for reasons of experimental

convenience only. It seemed wise to use this accumulated body of work as a founda

tion for the symposium, which has a wider interest. There were sessions on "the

Control of Eye Movements in Reading", "Letter and Word Recognition", "Sentence and

Text Recognition" and "Reading and Listening". The last-mentioned aimed at eluci

dating the possible role of an auditory (speech) component in the processing of

text in contrast to other visible languages such as in mathematical formulae and

cartography. It was decided to leave out many related topics, notable ones

being the broad areas of psycholinguistics and of reading education.

A session on graphic languages dealt wi th the plurality of forms and formats in the

art of presenting visible language. Graphic languages have deep historic roots and

in the course of their development there have been gradual adaptations to the

technologies involved such as carving, writing, printing, photographing, and to the

skills of the readership. The scientific approach to perceptual processes in reading

is of course a much younger one. A combination of arts and sciences in the domain

of producing visible language would offer prospects for more explicit and rapid

adaptation, yet leave ample room for creative expression.

A session on technological media concentrated on options (such as interactive use)

100 and drawbacks (such as limited text quality) of various electronic displays, which


are rapidly supplementing print on paper for both the professional and the general

reader. Most activities in this field are very recent and are geared primarily to

convenient technology and minimum cost and less to optimum communication.

The notion that the areas of 'reading research', 'graphics', and 'displays' were

separate, was borne out by the symposium. Reading processes are so fascinating to

study, that there is a trend towards sub-areas for specialists with specific methods

and paradigms, in which there is a danger of losing a clear-cut relation to normal

reading. There appeared to be a need to re-establish such a relationship and, for

the understanding of actual reading as it occurs in daily life, to let relevance

be a dominant factor in the evalutation of results and in the definition of problems

for research. How, otherwise, could such research be digested by the interested

graphic designers and display engineers? If the researcher is perhaps in a good

strategic position, the graphic designer is optimally suited to find creative

solutions for graphic communication. The display engineer is gradually developing

an awareness of the problems in defining the human interface, but,so far, seems

to make too strong demands on the adaptability of the professional users in the

absence of a suitable forum which could define the general user as well.

The cleft between the different approaches became evident too in many discussions

where designers lost track of the relevance of highly debated issues on the psycho

logy of reading or where psychologists were at a loss to provide answers to such

seemingly simple practical questions as desired text quality or lay-out. Discussions

as well as personal contacts may, we hope, inspire creative thinking on the relation

ship between practical questions and tangible research problems.

Considering the widely different backgrounds and aims of the participants the results

of the symposium can only be evaluated on the long term. However, it has already

become clear that there are pressing problems in the presentation of text and other

visual information and that the research on reading processes has great potential for

the definition of basic issues, the provision of methods and even answers. For one

thing, the actual reading that people engage in is the real issue and research should

reflect any shifts in reading habits. So far, the 'practical' producers of visible

language are left with little choice other than to work largely on implicit guidance.

The years ahead should show whether explicit guidance will be added to that and a

possible symposium held in a few years' time may provide an interim answer.

The proceedings of the symposium will appear in 1978 under the editorship of

Paul A. Kolers, Merald E. Wrolstad and Herman Bouma.

Papers presented at the Symposium:

Reading processesThe control of eye movements in reading

Ariane Levy-Schoen and Kevin O'Regan (Paris)

The control of eye movements in reading (tutorial)Keith Rayner (Rochester, N.Y.)

Eye movements in reading: eye guidance and visual integration. 101

George McConkie (Ithaca, N.Y.)

The role and control of eye movements in reading.Kevin O'Regan (Paris)

Moment-to-moment control of eye saccades as a function of textual parametersin reading.

Dennis Fisher (Aberdeen, Maryland)

Understanding the reading process through the use of transformed typography:PSF, CSG and automaticity.

Wayne Shebilske (Charlottesville, Virginia)

Reading eye movements, macro-structure and goal-processing.

Letter and word recognition

Alan Allport (Reading)

Word recognition in reading (tutorial)

Alexander Pollatsek and Thomas Carr (Amherst, Mass.)

Wholistic and rule-governed encoding processes in word perception.

Richard Venezky (Newark; Delaware)

Orthographic regularities in English words.

John Marshall (Nijmegen)

Implications of the acquired dyslexias for the study of normal reading.

Philip Smith and Anne Groat (Stirling)

Spelling patterns, letter cancellation and the processing of text.Don Bouwhuis (Eindhoven)

Letter recognition and word knowledge as determinants of word recognition.

Lester Lefton (Columbia, Carolina)

Peripheral information processing 'in reading.

Sentence and text recognition

Willem Levelt (Nijmegen)

A review of sentence perception research (tutorial)

Alan Baddeley (Cambridge)

Working memory and reading.Mogens Jansen (Copenhagen)

Relations between the qualifications of different groups of readers and different aspects of given text.

John Merritt (Milton Keynes)

Contexts, concepts and reading outcomes.

L.J. Chapman (Milton Keynes)

The perception of language cohesion during fluent reading.

Patricia Wright (Cambridge)

When two no's nearly make a yes: a study of conditional imperatives.

Anthony Pugh (Leeds)

Styles and strategies in adult silent reading.

Reading and listening

Dominic Massaro (Madison, Wisconsin)

102 Reading and listening (tutorial).

Uta Frith (London)

Reading by eye and wrltlng by ear.

Shulamit Reich (London)

How to make reading more like listening: a study of the stroop effect in sentences.

Lee Brooks (Hamilton, Ontario)

A comparison of implicit and explicit alphabets.John Morton (Cambridge)

Some experiments on facilitation in word and picture recognition and their relevance for the evolution of a theoretical position.

Graphic languages

Michael Twyman (Reading)

A schema for the study of graphic language (tutorial).Wim Crouwel (Delft)

Typography, a technique for making a text 'legible'.Anthony Marcel (Cambridge)

Paragraphs or pictographs: the use of non-verbal instructions for equipment.

Robert Waller (Milton Keynes)

Typographic access structures for educational texts.Jeremy Foster (Manchester)

The use of visual cues in text.

Richard Phillips (London)

Making maps easy to read - a summary of research.

Technological mediaRobert Rosenthal (Holmdel,N.J.)

The design of technological displays (tutorial).

Neil Wiseman (Cambridge)

Non-serial language

Willem Hoekstra (Geldrop)Electronic paperwork processing in the office of the future.

Arthur Eger (Delft)

Computer-aided design of graphic symbols and an alphabet.

Richard Jackson (Salfords)

Television text: first experiences with a new medium.

The visibility song

Bouwhuis/Hogue

Sometimes I feel so lonely, sometimes I feel so blue.Sometimes my heart's near breaking, my eyes have lost their view.Now I woke up this morning, I hate to face the day.But when I read a book that's when my sorrow goes away.

I process language, visible language, that graphic language,till the still small voice comes through. 103

104

Cognition

study of dialogues belongs to that part of linguistics called pragmatias.

language use (see e.g. Groenendijk and Stokhof, 1977; Haberland and

Towards an analysis of dialogue organization principlesH.C. Bunt

Introduction

This paper describes some of our recent research on fundamental aspects of dialogues,

focussing on the establishment of a framework for analysing the organizational prin

ciples in goal-directed dialogues. The aim of this research is twofold: on the onehand to provide insight, useful in the design of man-machine dialogues; on the other

we hope to obtain more insight into human language processing by studying the

communicative effects of utterances in a dialogue.

Our research builds upon work in two different areas: Artificial Intelligence (A.I.)and linguistics.

A.I. studies of dialogues that should be mentioned here include those of Grosz

(Grosz, 1977; Walker et al., 1976), Mann et al. (Mann, 1977), and Winograd et al.

(Winograd, 1977, Bobrow et al., 1977).

Grosz has studied 'task-oriented dialogs' in which a computer guides a person in

performing a certain task. Particular attention was paid to the establishment of

mechanisms for resolving definite noun phrases and interpreting elliptic expressions.

In the studies carried out by Mann et al. and by Winograd et al., dialogues on

limited subject domains are viewed as instances of discourse patterns specific to

the subject domain in question. There would be one pattern for travel-information

dialogues, another for client-waiter dialogues in a restaurant, etc. The machine

interpretation of an utterance then comes down to fitting the utterance into such

a pattern. Patterns of this kind are called 'schemas' (Winograd) or 'dialogue games'

(Mann), and are similar in character to data structures used in other A.I. work,

known as 'scripts' (Schank, 1977) or 'frames' (Minsky, 1975).

A limitation of this approach is that it tends to focus on domain-specific aspects

of dialogues, paying too little attention to important general principles of lin

guistic communication.

The linguistic

the theory of

Mey, 1977).

Grice (1975) has formulated some very general principles for cooperative linguistic

behaviour. From the operation of these principles Grice deduces the important notion

of aonveY'sational impliaatuY'e. Conversational implicatures are presuppositions which

a hearer must make in order to interpret a speaker's behaviour as being in accordance

with the general principles. For instance, if one Dutchman asks another: "Wheredoes Jan live?", and the answer is "Jan lives in Europe", it is conversationally im

plied that the answerer does not know in what country Jan lives (else he would violate

the principle that one should be maximally informative). Grice's conversational

principles are very general and vague, and need to be elaborated in order to be ofpractical use.

Utterances in dialogues very often contain expressions that refer explicitly to the 105

IPO annua / progress report /2 1977

(linguistic or extralinguistic) context. Examples are personal pronouns (I, you, it)

demonstratives (that), time-and place indications (now, before, here), etc. Important

work on the interpretation of such expressions has been done by Hausser (1976, 1977).

He has developed a treatment in terms of an extended Montague grammar, in which

sentences from a discourse are interpreted wi th respect to a formal context model. The

most important extension of this framework, compared to standard Montague grammar,

is that such a context model involves a so-called stage description which, roughly

speaking, is a sequence of utterances representing previous discourse with indications

of their semantic interpretation.

The best developed approach to the study of language as a functional system for

communication, is speech act theory. Speech act theory approaches the use of language

as a case of rUle-governed, intentional performance of actions: 'speech acts', and

it deals explicitly with such things as speaker- and hearer-intentions involved in

the performance of such acts (see also Bunt, 1976, p.98). One of the main points of

interest is the study of how utterances function in discourse: whether they assert,

question, promise, threaten, warn, etc. This functional aspect of an utterance is

called its ilZocutionary force. The 'content' of an utterance: that which is asserted,

asked, promised etc., is called its propositional content (Searle, 1969).

Speech act theory is still at an informal stage of development, and there is as yet

no generally accepted system of illocutionary forces. Searle (1969) analyses a

number of illocutionary forces in terms of the intentions and other aspects of the

mental state of speaker and hearer; Allwood (1976) gives a more refined analysis of

the factors that influence linguistic communication.

A framework for dialogue analysis

Linguistic and A.I. studies of dialogues h&Je so far mainly resulted in a number of

interesting data and observations. One of our aims is to establish a general frame

work in terms of which the data and observations can be interpreted and explained

and which can lead to a more systematic exploration of dialogues.

Since there is a great diversity in kinds of dialogues, varying from apparently chaotic

brainstorm dialogues to extremely rigid superior-subordinate instruction dialogues,

it is desirable to choose a certain dialogue genre on which to concentrate. We have

picked out informative dialogues, by which we mean dialogues for the purpose of ex

changing factual information about a well-defined subject domain, while at least

one of the participants knows exactly which information is to be exchanged.

In the present discussion we restrict ourselves in particular to informative dia

logues in which someone (A) wants to obtain certain information and thinks that

someone else (B) might possess this information. Assuming that B is willing to supply

the information, if he has it, A initiates a dialogue with B.To understand what happens in such a dialogue, we should first of all recognize that

A's saying something to B is intended to have an effect on B's state of knowledge,

as the result of B's understanding of what A said. Subsequently, B will react on the

basis of his understanding. Such a reaction may be an observable reaction, such as an

106 answer or an acknowledgement, or a non-observable reaction such as believing what A

said or correcting a wrong presupposition about A. What reaction is sollicited will

depend on B's state of knowledge and the changes brought about in it by A's utterance.

We must, th~refore, consider states of knowledge and the effects of utterances on

them.

The total state of knowledge of an intelligent language user is so complex, that

it seems entirely hopeless to characterize it exhaustively. Fortunately, this is not

necessary: only certain properties of knowledge states need to be specified. To

establish the most relevant kinds of properties, let us take a closer look at various

aspects of a knowledge state, and how these aspects are involved in informative dia

logues. First, it can be observed that not all the knowledge in a knowledge state is

affected by utterances in such a dialogue. An informative dialogue is about a certain

subject domain; knowledge that is unrelated to this subject domain and to the communi

cation situation will be unaffected. For instance, knowledge of the language and how

to use it, and general world knowledge, indispensible for being a competent language

user,will remain unaffected. Other types of knowledge, such as knowledge of the other

participant's aims in participating in the dialogue, do change in the course of theconversation.

Our approach is now as follows. We first try to separate those types of knowledge that

change during informative dialogues from those that remain constant. We make some

general assumptions about the latter, and then study the way in which various types

of the first kind operate dynamically in informative dialogues.

Dynamic factors

In the above characterisation of the kind of informative dialogues we consider here,

we already mentioned three types of knowledge that playa crucial role: the goal of

the dialogue-initiator A, his knowledge of contingent facts in the domain of discourse,

and his presuppositions about B's knowledge of such facts. We label these types of knowledge

as A-GOAL, A-FACTS, and A.B-FACTS, respectively. In order to open the dialogue, A

must also have developed, at least partially, a plan of action for achieving his

goal. We designate A's plan by A-PLAN. We consider the nature of plans below.

A-GOAL, A-FACTS, A.B-FACTS and A-PLAN are factors that we must assume to be among

A's knowledge at the beginning of the dialogue. As soon as the dialogue develops, the

other participant B builds up the same types of knowledge. We have here for each

participant four dynamic knowledge factors of obvious importance. We will call the

dynamic part of the total knowledge state of a dialogue participant A his K-state,

designated by KA, and the various factors in a K-state we call K-factors.*

A closer look at dialogues reveals that, besides the four K-factors mentioned above,

there are several others. For two dialogue participants to act cooperatively, they

* The terminology used here is perhaps not very satisfying. It is somewhat odd to callone's goals and plans 'knowledge'; in Bunt (1978) the term conversational information isused. Our use of the term 'presupposition' is also disputable. We use this term torefer to one's assumptions about goals, plans etc. of the dialogue partner; thereason for calling these assumptions presuppositions is that they typically must bemade in order to interpret what is being said as making sense. The related issue ofthe degree of certainty of various types of knowledge is not discussed here. 107

must have some knowledge of each other's goals and plans. So we must also include in

KA: A's presuppositions about B's goaZ, designated by A.B-GOAL, and A's presuppositions about

B's pZan, designated by A.B-PLAN. Similarly for KB.

This brings us to six K-factors for each participant: GOAL, PLAN and FACTS, and pre

suppositions about these factors in the other participant's K-state.

There is one more type of K-factors that we want to introduce here, the relevance ofwhich is illustrated by the following dialogue sample:

(1) A:

(2) B:

(3) A:

(4a) B:

(4b)

Do you know what the capital of the Netherlands is?Yes, Amsterdam.

That's right.

Why do you ask?

I don't believe you thought I didn't know that:I I

I

II

In sentence (4b) we can identify a part (I) that refers explicitly to B's factual

knowledge (B-FACTS), a part (II) that refers to A's presuppositions about B-FACTS

(i.e. A.B-FACTS), and the complete sentence refers to B's presuppositions about

A. B- FACTS, i. e. to presuppositions about presupposi tions •

In general, we must assume the existence of presuppositions about presuppositions of

the other participant. For instance, in order to correct a presupposition about

oneself, which is attributed to someone else, one must first make the presupposition

that the other participant is making the wrong presupposition:We therefore add to our six K-factors the presuppositions of one more level of in

direction, and designate these (for A's K-state) by A.B.A-GOAL, A.B.A-FACTS, and

A.B.A-PLAN.

In principle, one could go on indefinitely adding presuppositions of higher levels

of indirection, but from an empirical point of view it appears that we can stop here,

given our restriction to informative dialogues. Utterances like "I think that you

think that I think that you don't know this", which would correspond to one level

higher than that in A.B.A-FACTS, are almost unintelligible, and do not occur in informative dialogues.

We therefore conclude, tentatively, that a K-state can be characterized by the nine

factors distinguished so far.

This is not to say, of course, that these K-factors represent all the knowledge

that influences the course of a dialogue. Other factors are for instance:

(1) knowledge of the language, and how it is conventionally used in dialogues;

(2) presuppositions about the other participant's knowledge of the language;

(3) familiarity with the domain of discourse,

(4) knowledge of the social status of the other participant.We deal with these factors in the following way. First, we assume that they are

constant throughout the dialogue. Second, we assume that

(1) both participants have full command of the language;(2) both participants presuppose that the other has full command of the language;

108 (3) the participants are quite (and equally) familiar with the domain of discourse;

(4) the social status of the participants (is such that it) does not give rise to

restrictions on the communication.

For all other kinds of knowledge (such as: "He always exaggerates", "He's a big

liar" we assume that similar conditions are met (such as: "He's not intentionally

misleading me"), such that they have no distorting effect on the communication. A

dialogue, occurring under these conditions, will be said to occur under ideal aonditions,

or simply to be an ideal dialogue.

The notion of an ideal dialogue is of course an abstraction; hopefully a fruitful

one, that allows us to concentrate on the most essential factors involved in informative dialogues.

Summarizing our discussion of knowledge factors:

The basic assumption is that language is used functionally in dialogues in order to

bring about certain changes in the other participant's state of knowledge. A number

of knowledge' factors have been identified that may be changed by utterances in in

formative dialogues. These factors, called K-factors, make up the dynamic part of a

state of knowledge, called K-state. Other knowledge factors that may influence the

course of a dialogue are assumed to have 'neutral' values and to remain constant

throughout the dialogue; a dialogue to which this assumption applies is called an

ideal dialogue.

The nature of plans

With the exception of a plan, all K-factors can be thought of as sets of propositions.

A-GOAL can be thought of as the set of propositions that A wants to be true at the

end of the dialogue; A-FACTS as the set of propositions representing contingent

facts in the domain of discourse, that A believes to be true; A.B-FACTS as the set

of propositions about B-FACTS that A believes to be true, etc. But a plan cannot

be thought of as a set of propositions. In the simplest case a plan is a sequence

of actions, leading to the desired goal. In general, a plan contains several alter

native sequences of actions, the choice of an alternative being determined by the

outcome of actions earlier in the plan.

As an example of such a plan, consider the following part of a plan that is pro

grammed into a computer system that supplies information about train departure

times on the basis of a simple dialogue, in which the system determines what information is desired:

1. Explain to the user which directions he can choose

2. Ask the user which direction he is interested in

If his answer cannot be recognized, go back to 1, but do this no more than threetimes.

If his answer is recognized as "R", then3. Confirm:"direction R"

4. Ask the user whether he wants to travel today.

If the answer is yes, then

5. Ask whether he wants to know the next departure times.If the answer is yes, then

109

6. Provide today's next departure times in the direction R.

If the answer is no, then

7. Ask what time he wants to leave approximately, etc.

Such a plan is conveniently represented by a labelled graph as follows:

tFig. 1. Graphical representationof a plan.

The expression 'max.3' at the loops in this

graph indicates that the loop should be taken

at most 3 times. The plan represented here

is in fact part of the plan used in the ex

perimental dialogue system described else

where in this volume (Muller, Nooteboom,

Willems, 1977). Note that every path in

Fig. 1, leading from the top node to a

terminal node, such as ® ' contains a se

quence of actions leading to the desired

goal. Each of these actions has a goal of its

own, a subgoaZ relative to the overall goal,

and the fulfillment of all sub goals along

such a path is equivalent to the fulfilment

of the overall goal.

The plans that people use in dialogues are

usually not completely elaborated, like the

plan represented in Fig. 1. Rather, these

plans have the form: "I'll do this first;

then, if he does such, I'll do so; if he does

something else, I'll do something to make

sure that so and so". Only part of such a

plan is elaborated, another part does not

specify the actions to be taken, but only a

subgoal to be achieved. We call such a plan

an incomplete plan.

An important property of the kind of plans we consider here, is that they begin wi th an

action. We will refer to the subgoal of this action as the immediate goal of the parti

cipant in question. We designate the immediate goal of participant A by A-ImGOAL.

Turn-taking

We ultimately hope to describe how dialogue utterances function, in terms of tl1eir

effects on K-states. We therefore have to indicate for each of the K-factors at

which point in the dialogue we are considering them. We do this by indicating at

what tu~ in the dialogue they occur. Turn-taking in natural dialogues is a com

plicated phenomenon (see Sacks et al., 1974), but in order to simplify matters we

will assume that speakers do not speak simultaneously, etc.

110 In that case the following notation will do.

Let A and B be

the dialogue by

B's K-states at

the participants in a dialogue, and let

a sequence u(~) of utterances. Let KID)

the beginning of the dialogue.

A be the one who initiates

and K(o) designate A's andB

As the result of A's utterances u(o), B's K-state changes from K(o) to a new state

K(l) " Next it is B's turn" let u t1 ) be the sequence of B's utt~rances. As the resultB 'B (0) (1)

of these utterances, A's K-state changes from KA · to a new state KK ,etc. so

that the following picture emerges.

Fig. 2. K-states atsuccessive turns.

K(O)A

U~)~

11U(1)

00( B

Kl11

AU~) ..

Jl

11

We will designate the K-factors in a K-state at turn i, by

the index i, so in KA(i) we have A-GOAL., A-FACTS., etc. This1 1

description of the dynamics of K-states is still too crude for

our purposes, because we want to consider the changes in K

state that result from a single utterance within a turn. This

can easily be done by adding intermediate K-state5 Kl i ,j) be

tween the states Kli ) and Ki i+1); however, in order to avoid

the notational complexities that arise from this addition we

will simply assume here that at each turn, the dialogue parti

cipants make only one utterance.

Question-Answer sequences

The conceptual framework so far developed is intended to serve

as a basis for analyzing principles of dialogue organization,

point of issue being that changes in the state of knowledge of

a listener can be described in terms of changes in K-factors

which are systematically related to properties of dialogue

utterances. The establishment of a system of rules capturing

this relationship would be of particular interest in the

design of man-machine communication systems. In this section we

give an impression of what can be done in this direction on the

basis of our conceptual framework, by considering some funda

mental aspects of question-answer sequences in ideal informa

tive dialogues.

In a dialogue, each participant interprets the utterances of the other participant

and generates new utterances. In the interpretation of utterances it is useful to

distinguish at least the following two aspects:

(i) the determination of what is being communicated;

(ii) the determination why this is being communicated.

For instance, the interpretation of the utterance "John bought a new car yesterday"

involves (i) the recognition that one is being told that John bought a new car

yesterday, (ii) ~he realization that the speaker presupposed that he was telling

something not already known to the listener.

In the generation of utterances we can distinguish at least the following aspects:

(i) the setting up of a new (sub)goal to be achieved next in the dialogue;

(ii) the formation of a plan for achieving that goal. 111

supposed that B might

does not belong to

Let us now look at the interpretation of a question and the generation of an answer

in terms of changing K-factors. General observations on linguistic communication

that have been made by Grice, Searle and others (see the introductory section of

this paper), can often be given a more precise formulation in these terms (see

Bunt, 1978 for more details).

Let the dialogue participant A pose,at stage i in the dialogue, the question uli

) to

B, with the propositional content p' For instance, ul i ) =."Did John buy a new car?"

with p = "John bought a new car." B's interpretation of ul1) as a question means that

he recognizes that A's immediate goal is to obtain the information described by p.

In other words, B.A-ImGOAL i +1 will be that p belongs to A-FACTS i+1 (which was also

A- ImGOAL i ; this corresponds to Searle's sincerity condition for questions).

B's realization as to why A posed the question ul i ), involves (under ideal conditions)

the following changes in B's K-state:

- B presupposes that A didn't have the information p himself, i.e. "p does not be-"long to A-FACTS i ' belongs to B'A-FACTS i +1 .

B recognizes that A would only ask B the question ul1

) if A

have the information p, i.e.""p does not belong to B-FACTS'.'1

A.B-FACTS i " belongs to B.A.B-FACTS i +1- B recognizes that A would only ask B for the information p if A did not suppose

that B was just about to supply p anyway, i.e .""i t is not the topmost action in

B-PLAN i to tell p" belongs to A.B-PLANi

" belongs to B.A.B-PLAN i +1 .

The first and third of these points correspond to Searle's preparatory conditions for

questions; the second point appears to be something that Searle has overlooked.

On the basis of his interpretati~n of uli ) , B will generate a response. An important

part of B's interpretation of ul1) was the recognition of A's immediate goal. If

correctly recognized, A's immediate goal was what B thinks it is, i.e.

B.A-ImGOAL i +1 = A-ImGOAL i

and this goal is that p belongs

question seriously and attempts

haviour will follow if B adopts

B-GOAL i +1 = B.A-ImGOAL i +1

to A-FACTS. l' Under ideal conditions, B takes A's1+

to behave cooperatively. Optimal cooperative be-

A's presupposed immediate goal as his own goal, i.e.

Subsequently, a plan can be formed to achieve this goal. The simplest possible plan

consists of simply giving the information p to A, provided the information p is

available in B-FACTSi

+ 1 . This would be an utterance with the illocutionary force of

an assertion; the condition of availability is one of Searle's preparatory conditions

for making an assertion, as well as a special case of Grice's'principle of quality'.

The simple rule governing the generation of a response, that seems to operate here,

is: If B-GOAL i +1 is that p belongs to A-FACTSi

+1 , and p belongs to B-FACTS i +1 , then B

PLAN i +1 is the single action of telling p to A.

Execution of this plan is answering A's question.

Of course there are many other kinds of replies that B can sensibly give to A's

112 question, depending on B's K-state. For instance, if B thinks that the question con-

tains a wrong presupposition, he may be expected to correct this.

We hope that it will be possible to develop a system of rules within the present

framework comprehensive enough to be of interest as a model for human informative

dialogues and to guide the operation of a machine engaging in an informative dialogue

with a human user.

Summary

'Informative dialogues', i.e. dialogues with the aim of exchanging certain well

defined information, are studied for their organizational principles. Basic to this

study is the assumption that language is used in a functional way in order to influ

ence the other participant's state of knowledge. A number of knowledge-state factors

have been identified that are subject to changes during an informative dialogue. These

factors, which include a goal, a (possibly incomplete) plan, knowledge of the domain

of discourse, and various types of presuppositions about the other participant, make

up the variable part of a knowledge state, called K-state. Other knowledge factors

that might influence the course of a dialogue are assumed to have no distorting effecton the communication and to remain unchanged throughout the dialogue; a dialogue to

which this assumption applies is said to occur under 'ideal conditions'. The aim of

this study is to develop a system of rules for the generation and interpretation of

utterances in informative dialogues occurring under ideal conditions, by relating

properties of utterances to properties of K-states. As an illustration, a simple rule

has been sketched for giving the complete direct answer to a question.

References

Allwood, J.(1976) Linguistic communication as action and cooperation: a study inpragmatics,Gothenburg monographs in linguistics, ~.

Bobrow, D.G., Kaplan, R.M., Kay, M., Norman, D.A., Thompson, H. and Winograd, T.(1977) GUS, a frame-driven dialog system, Artificial Intelligence, ~, p. 155-173.

Bunt, H.C. (1976) Some recent developments in semantics, I.P.O. Annual ProgressReport, D, p. 94-107.

Bunt, H.C. (1978) Dialogue analysis and speech act theory, in: Papers from the IVthScandinavian Conference of Linguistics, Middelfart (Denmark).

Clark, H.H. and Clark, E.V. (1977) Psychology and language. Harcourt Brace Jovanovich,New York.

Grice, H.P. (1975) Logic and conversation, in: P. Cole, J.L. Morgan (Eds.), Syntaxand semantics, vol. ~, Speech acts. Academic Press, New York

Grosz, B.J. (1977) The representation and use of focus in dialogue understanding.Ph.D. Thesis, Berkeley (California).

Groenendijk, J. and Stokhof, M. (1977) Semantics and pragmatics, a theory ofmeaning. Paper presented at the Conference on empirical and methodologicalfoundations of semantic theories for natural languages, Nijmegen, March 14-18.

Haberland, H. and Mey, J.L. (1977) Linguistics and pragmatics. Journal of Pragmatics,l, p. 1-11.

Hausser, R.R. and Zaefferer, D. (1976) Questions and answers in a context-dependentMontague Grammar. In: Proceedings from the Bad Homberg workshop on formal semantics.

Hausser, R.R. (1977) The semantics of mood. Paper presented at the Symposium onSpeech Acts and Pragmatics, Dobog6ko (Hungary), September 5-8.

Mann, W.C., Moore, J.A. and Levin, J.A. (1977) A comprehension model for humandialogue. Proc. 5th Int. Joint Conference on Artificial Intelligence, Boston(Mass.), August 22-25. 11 3

114

Minsky, M.L. (1975) A framework for representing knowledge, in: P. Winston (Ed.)The psychology of computer vision, McGraw-Hill, New York.

Muller, H.F., Nooteboom, S.G. and Willems, L.F. (1978) An experimental system forman-machine communication by means of speech, this issue.

Sacks, H., Schegloff, E. and Jefferson, G. (1974) A simplest systematics for theorganisation of turn-taking for conversation. Language, ~, p. 696-735.

Schank, R.C. and Abelson, R.P. (1977) Scripts, plans, goals and understanding.L. Erlbaum, Hillsdale (N.J.).

Searle, J.R. (1969) Speech Acts. Cambridge University Press, Cambridge (U.K.).

Walker, D.E. (1976) Speech understanding research, final technical report.Stanford Research Institute, Menlo Park (California).

Winograd, T. (1977) A framework for understanding discourse. A.I. Memo, StanfordUniversity.

Knowledge of Dutch three-letter words

Rectification

In the previous issue of the Annual Progress Report the following article contained

some serious printing errors:

Knowledge of Dutch three-letter words

J.e. Jacobs, A.L.M. van Rens and D.G. Bouwhuis

IPO Annual Progress Report, 1976, ~, p. 77-84.

On page 78 of this article table 1 gives the probabilities of letters appearing in

the three positions of Dutch three-letter words. This table contains four printing

errors and should be replaced by the one printed below.

1st letter 2nd letter 3rd letter

% % %

b 7.3 e 21.0 t 11.1

d 6.6 0 20.1 1 10.1

P 6.5 a 18.5 s 9.8

1 6.5 i 14.3 k 9.7

k 6.3 u 13 .5 n 7.2

h 6.3 r 3.5 p 6.3

r 6.2 1 2.7 f 5.6

t 5.5 n 1.4 m 5.5

m 5.5 k 0.6 e 5.5

a 5.0 d 0.6 g 4.6

w 4.3 t 0.4 r 4.2

0 4.2 s 0.4 i 3.6

n 4.1 p 0.4 a 3.2

g 3.8 h 0.4 d 3.1

e 3.8 g 0.4 b 2.5

v 3.2 c 0.4 u 2.4

s 3.2 b 0.4 0 2.0

z 2.8 m 0.3 x 1.5

j 2.8 j 0.3 h 0.8

f 2.5 w 0.1 w 0.7

i 1.3 v 0.1 c 0.4

u 1.1 f 0.1 z 0.1

c 1.1 z 0.0 y 0.0

q 0.1 Y 0.0 v 0.0

y 0.0 x 0.0 q 0.0

x 0.0 q 0.0 j 0.0

Table 1. Letter distributions of initial,middle and final letters of the 713 realDutch words.


115

116

Ergonomics and perceptual aids

Legibility of rectilinear digitsH. Bouma and F. L. van Nes

Introduction

Over the last decade, electronics technology has advanced very rapidly, leading to

great decreases in size and price of all sorts of information-processing equipment.

Consequent changes in controls and displays present opportunities as well as challenges to ergonomics. Controls, on a calculator or oscilloscope, for instance, tend

to get more numerous and smaller, and thus more difficult to handle. On displays,

the presentation of analog information is being superseded by representations using

letters and digits. These symbols in that case generally consist of a number of parts,

either dots, arranged in a matrix, or line segments. The same parts thus figure in

different symbols. The shape of the latter, and of the display device itself, are

mainly chosen on technical and economic grounds. It then remains to be seen how the

resulting text or numbers compare with what people are still most used to: letters

and digits which are not divided into parts, and printed in black on white paper.

As yet, little attention has been given to the question of how well people can read

text, numbers or other symbols generated on electronic displays. In our institute

some research on the legibility of matrix letters for CRT screens was done (Bouma

and Leopold, 1969; Bouma and van Rens, 1971), and the present paper will present

data on the legibility of the rectilinear, segmented digits now appearing on most

displays for numerical information. Such digits coexisted for some time with digits

containing curvilinear parts (Fig. 1), as successors to the conventionally shaped

numbers displayed by Nixie tubes. Nowadays, (almost) completely rectilinear digits,

like those in row d of Fig. 1, appear to have gained.

8

b

c

d

IJ

,,

1I

--,c.

-,L

-,::.,

7J

'.',

L1J

Ll1

LII

'.-'LJ

,::J

5

-,o)

o

E,

-,,

J•

nJ

-,I

CILI

B

LILI

9

oJ

9

o~

,-,U

o

IILI

,-,LI

Fig. 1. Examples of digit shapes, built from curved and straight line segments. Rowd shows the shapes investigated in the ,experiment reported. 117


When judging the legibility of symbols, 3 factors should be taken into account:

visibility, discriminability and acceptability. For line-segment digits, this leadsto the following considerations:

visibiZity, or identifiability, depends on contrast between the digit on display

and its background. It can be obtained by selecting a suitable combination of line

segment dimensions and luminances, and by avoiding reflections on the display from

other light sources;

discriminabiZity, or individuality of different digits, is important, even more so

than with letters, because there is generally no redundancy in numbers, as opposed

to words. Line-segment digits may be prone to confusion owing to their similarity of

shape, therefore this factor needs special attention;

the acceptabiZity of symbols depends on their correspondence to readers' concepts,

in this case, of digit shapes. A high acceptability, i.e. reasonable degree of

resemblance to those concepts, should be aimed at in displays used by the general

public, like watches or calculators. Differences between the shapes of handwritten

digits in different countries, for instance in the digits 1, 7 and 8, may reflect a

problem in this area. Professional users of digit displays may be satisfied with

somewhat more unusual digit shapes, as long as these possess a high individuality.

Our experiments were mainly concerned with discriminability. We chose perceptually

difficult conditions in order to obtain a sufficient number of errors for analysis.

Jhe results of this analysis led to suggestions for improved digit shapes. The improvements aim in the first place at increasing digit discriminability, and their

acceptability in the second place. The present paper is a brief report of the ex

periments which will be published elsewhere in more detail.

MethodTwo observational conditions were used:

1. reading or recognizing luminous single digits, of the type shown in the lowest

row of Fig. 1, at a distance of 16 m. The height of the digits, 19 mm, then

corresponds to a visual angle subtending 4.1 minutes of arc. At this distance,

about 60% of the presented digits was correctly recognised.2. parafoveal reading at a normal distance, i.e. 57 em, in the right visual field.

Both isolated digits and three-digit numbers were presented, in two separate

groups, at an excentricity which also led to an average recognition score of about

60%. This meant excentricities of 30 degrees for the single digits, and 5, 7l and

10 degrees for the hundreds, tens and units, respectively, of the three-digit

numbers. In these parafoveal conditions the stimuli were presented for 100 ms,

to eliminate the possibility of foveal stimulus projection through eye movements.

Ten subjects participated in all parts of the experiment. Each digit was presented

10 times, in random order, in the separate parts.

Results

The average scores of correctly recognised digits by the subjects were:

118 - for distance reading: 64%

- for excentric reading of single digits: 65%

- for excentric reading of numbers: 81%, 37% and 53% for hundreds, tens and units,respectively.

Fig. 2 shows the recognition

percentages for each digit, a

veraged over the 3 parts of the

experiment. Considering averages

is justified because the dif

ferent observational conditions

02468Number of line segments

1234567890Digits

)

0 1\0\>

\

\\

o

40

100

Fig. 2. (left) and Fig. 3 (right). Percentages ofcorrectly recognised presentations, shown at theleft for each digit (averaged over all observational conditions) and at the right as a functionof the number of line segments of the digits concerned. The points at five and six segments eachrepresent averaged percentages from 3 digits (2,3, 5 and 0, 6, 9, respectively).

yield similar results. A compar

ison of Fig. 2 with the way in

which the digits are made up leads

to the assumption that a relation

exists between a digit's correct

score and the number of segments

it counts. In Fig. 3 therefore

the percentage of correct res

ponses is shown as a function of

the number of line segments of

the digits concerned. The point

for 5 line segments represents

the average correct score of the

digits 2, 3 and 5, because these

are all made up of 5 segments.Likewise, the point for six line segments represents the average correct score of

0, 6 and 9. The other points in Fig. 3 are each related to one digit. The figure

demonstrates clearly that correct scores decrease with the number of segments which

digits count.

IJ'i!.!: 80

Q)..8 60III

I..(,)Q)....o(,)

11 9

61 2 3 4 5Number of difference

segments

~

\1~

'<"-')~

~o

Fig. 4. The percentage of confusionsas a function of the total number ofdifference segments between the pairsof digits concerned, averaged overdigit pairs and observational conditions. See text for an explanationof "difference segments".

Subjects are quite willing to make guesses about the identity of digits which they

see more or less vaguely. The response "illegible", though explicitly allowed, was

% given for only 1% of' all stimulus presen-

'10 tations. This means that the incorrect

responses almost all consist of confusions

between digits. It was found that con

siderable differences exist between the

frequencies of all the possible confusions,

in other words, they occur systematically.

Analysis of the system underlying the

confusions may yield information on the

perceptual processes which occur duringdigit recognition. Fig. 4 shows that an

inverse relation obtains between the per

centage of confusions - averaged over all

observational conditions and all digits

and the total number of "difference seg

ments". This total is the sum of the

number of line segments in which the di-

gits concerned differ, by addition as well as omission. For example, the total

number of difference segments for the digits 4 and 7 is three. The important conclu

sion that can be drawn from Fig. 4 is that the probability of confusion with another

digit diminishes strongly as that digit differs by the greater number of segments

from the digit actually presented.

Confusions between digits are generally not symmetrical. It turned out, for instance

that at a distance of 16 m the presented digit 8 was fourteen times read as "9",

whereas the presented digit 9 was only read three times as "8". In this case, as

well as in many others, the digit perceived oftener contains fewer segments than the

digit presented than the other way round, a kind of simplification tendency in per

ception.

Summarising, the results can be described with 3 general rules:

1. the smaller the number of segments from which a digit is built up, the better itis recognised;

2. the larger the total number of segments in which two digits differ, the less the

probability that they will be confused;

3. a digit which is not correctly recognised, is more often perceived as one with a

structure simpler than that of the presented digit than the other way round.

Discussion

depend on the distinctive function attribut

able to the separate segments. If, for

instance, a common segment occurred in

all digits - which is not so - its per

ception would be of hardly any importance.

The distinctive role of the seven line

segments has been determined for all

digit pairs on theoretical grounds, i.e.as regards the contribution of individual

line segments to the difference in shape

between the members of a pair. Here we

shall only consider those pairs which

differ in one line segment. The analysis

of pairs with 2 differing segments is

analogous, and does not lead to different

conclusions (Bouma and van Rens, 1971).

Fig. Sa depicts all digit pairs differing

only in one segment, written inside the

segments concerned. It turns out thatsegments C and F each have a distinctive

function for 2 digit pairs, and thatsegments D and E have no distinctive

b

c

B

aE

F

Fig. Sa. Digit pairs which differ in onlyone segment, depicted in that segment. Thisshows that the segments C and Fare imporuRnt in distinguishing between differentdigits.Fig. Sb. Number of confusions between thedigit pairs from Fig. Sa, written in thecorresponding segments. The numbers 38and S4 each are the sums of confusions between the two digit pairs depicted in segments C and F, respectively.

Fig. 4 suggests that the number of diffel'enoe segments plays an important role in re

cognising rectilinear digits. Consequently, it is interesting to investigate whether

all segments are equally prominent in perception; i.e., whether they all carry the

same perceptive weight. This will mainly

120

function for pairs with just one difference segment.

Obviously these considerations are of limited value if they do not correlate with

the perceptive significance of the separate segments. This significance should, for

segments A, Band G, be higher than for D and E, but lower than for C and F. It is

very simple to check if such relationships hold, by counting the number of confusions

between the digit pairs concerned. The result of this operation is shown in Fig. sb:

in each segment the number of confusions is shown between the digit pair(s) from the

corresponding segment of Fig. Sa. The agreement between these numbers of confusions

and the related numbers of digit pairs 0, 1 or 2, is good. When this analysis is ex

tended so as to include digit pairs differing in 2 segments, a correlation coefficient

of +0.97 is found between the distinctive role of the segments and the number of

confusions between corresponding digit pairs (Bouma and van Rens. 1971).

Consequences in the design of rectilinear digit shapes

Can we use the experimental results now to obtain a better design for line-segment

digit shapes? As stated in the introduction, by "better" we mean in the first place

a higher discriminability of digits. Since the number of confusions increases

sharply with a decrease in difference segments 8, as demonstrated by Fig. 4, we seek

to avoid low values of 0, especially 8=1. The upper half of Fig. 6 shows the digit

shapes investigated, whereas in the lower half a number of alternatives are drawn,

with the aim of increasing the distinction between digit pairs with 0=1. The follow

ing points may be considered:

1. changes in shape for /6/ and /9/. This would lead to 8=2 instead of 0=1, for 5

of the 7 pairs fr0m Fig. Sa. One new pair with 8=1 would result, however, viz.

/4/ and /9/. Still, the number of pairs with 8=1 would be 4 less, whereas the

acceptability of the /6/ and /9/ would probably not suffer from the change.

2. a changed position of /1/ in the segment network, viz. from the right to the

left vertical line segments. Especially in numbers with more than one digit, this

would enlarge the distinction between /1/ and /7/, which pair would then formally

get 0=5 instead of 8=1.

3. another way to increase 8 for the digit pair /1/ and /7/ is to change the shape

of the /7/ as drawn in Fig. 6. Indeed, 8=2 results for this pair; however, for

the pair /7/ and /9/ 8 decreases from 2 to 1. So no overall improvement is ob

tained. Also, it remains to be seen whether the lower /7/ in Fig. 6 is as acceptableas the higher one.

4. Similar objections can be raised to the changed digit /0/ in Fig. 6. The lower

shape will hardly found to be acceptable, and would create a new pair with 8=1:

/0/ and (the new) /6/, which had 0=3. The motive for the change in shape of the

/0/ would be an increase from 8=1 to 0=3 for the pair /0/ and /8/.

So far, we have only considered omissions and additions of line segments, retaining

their original shape. It is also possible to design improved digit shapes by accentu

ating their perceptually important line segments. Possibilities include making such

segments a bit broader, say 50%, or, in the case of horizontal segments, to lengthen

them somewhat, e.g. 30%. It can be shown that the acceptability of such digit shapes

has not decreased. An increase in acceptability may be obtained by increasing the 121

slant of the digits to 20 0• This has the added advantage of increasing the distinct

ion between line-segment numbers, as a group, and capitals.

L~:-l,,-, '0 1"'''1 ,-,: C""': :: ____ 1 CI CI I-i": :;

fL..J ;=1'___J

:: I ;'",.J ;'.-_.J I-' Lr U_H ;~

I, ,'_Ir,

I '-I oFig. 6. Upper row: the digit shapes investigated; lower row: possible alternativesfor 5 of the 10 digits. The alternatives originate from the experimental data.

In conclusion, we arrived at a new design for rectilinear digit shapes as shown in

Fig. 7. Only after an experiment with a display incorporating this design, under the

same observational conditions as in the present experiment, could the effect of the

proposed changes on discriminability and acceptability be evaluated. Unfortunately,

such an experiment was not possible because no displays based on the new design have

been constructed.

/ -,C

LI1 56 q "I_I

122

Fig. 7. Proposal for improved seven-segment digit shapes. The improvements relate inthe first place to the discriminability of the digits, and only in the second placeto their acceptability.

Summary

Technical and economic developments have led to a rapidly increasing use of new media

for text presentation. Little is known about the legibility of the letter- and number

shapes which are used for such electronic media. This paper deals with research on

the recognisability of numbers built up from straight line segments, which therefore

have a schematic form. Erroneous recognition of such numbers leads to confusion be

tween them. The distinctive function of the individual line segments has been determined

from the errors. This analysis leads to improved design of the number shapes. First,

the improvements aim at increasing the discriminability of the numbers; second, im

provement of their acceptability, i.e. resemblance to the traditional number shapes,

plays a role.

Acknowledgement

The authors are greatly indebted to A.L.M. van Rens, for his impor.tant contribution

in designing the experiment and elaborating the data from it.

ReferencesBouma, H. and Leopold, F.F. (1969) A set of matrix characters in a special 7 x 8

array, I.P.O. Annual Progress Report, i, p. 115-119.Bouma, H. and van Rens, A.L.M. (1970) Cijferherkenning bij een indicatorbuis met 7

lijnsegmenten, I.P.O. Rapport no. 179.

Bouma, H. and van Rens, A.L.M. (1971) Completion of an alphanumeric matrix displaywith lower-case letters, I.P.O. Annual Progress Report, ~, p. 91-94.

123

A typewriter for a motorilcally handicapped person, operated by headmovements

PH van der Heijden*, H. Bouma, H.E.M. Melotte and F. Meyer**

Introduction

A typewriter has been made for a motorically handicapped person deprived of the

normal use of his arms and hands. It enables him to display his text on a screen

before it is typed out on a printer.

The handicapped person used the machine intensively and successfully for over a year

both to communicate with others and to keep himself creatively occupied.

Since it was to be expected from the beginning that the patient would retain for a

relatively long time the ability to move his head, the search for a solution was

concentrated on that possibility. The main idea was to fix a lamp to the patient's

brow, so that by movements of the head he would be able to direct a beam of light

onto a "keyboard" with photosensitive cells (see Fig. 1).

Fig. 1. The user operates the typewriter by directing the light beam from a lampfixed to his brow upon a "keyboard" with photocells. Other operating functions werelater added to this panel.

Members of Philips Research Laboratories*** and of the Institute for Perception

* user of the apparatus

** Philips Research Laboratories, Eindhoven

124 *** A.H.T. Sanders and H.A.J. Sanders provided the electronics

/PO annual progress report /2 /977

Research formed a team to cooperate on the implementation of this project. It turned

out that the idea was already known in the literature (Soede, Stassen, van Lunteren

and Luitse, 1973) and there were even a number of prototypes in existence in theNetherlands. However, since it was not possible to get hold of such a prototype in

the short term and postponement was not acceptable, the team decided, in order to

save time and cut out lengthy development work, to build a similar apparatus them

selves, based on a slightly modified design.

After a week of preparatory discussions, work on the construction of the apparatuswas started on 23 September 1976, and on 22 October 1976 the equipment was installed

in the user's home. A commercially available electronic typewriter is used, com

prising a memory and display screen (Superbee), coupled to a printer. The signals

are given with a small lamp fixed to the user's brow, which directs a beam of visible

light on to a panel with photosensitive cells. Via an "interface" the "strokes" on

the cells are passed through to the typewriter and displayed on the screen. In this

arrangement there is full access to the letter, numeral and character facilities of

the Superbee display screen. The built-in memory provides ample possibilities for

altering and correcting the text before print-out.

Evaluation

The user himself regularly wrote an evaluation (in Dutch), a selection from which

is given in translation below.The user has the apparatus set up in his study beside a pegboard for letters and

newspaper articles and a page-turning device, which he can also operate from his

wheelchair (see Fig. 2).

Fig. 2. The equipment set up in the user's study in accordance with his personalwishes. 125

24 October 1976

First general impressions

I have had the apparatus at home now for just over two days. The day before yesterday

I worked on it from 12 to 3 in the afternoon, and then from 7 to 10 o'clock in the

evening (first flush of enthusiasm:). Yesterday I had a break, and today I have again

worked on it from 4 to 6 and this evening from 7 to 11 o'clock. This proves that I

have so far had no trouble from fatigue. On the contrary, the activity has something

peaceful about it: you sit quietly and set down your thoughts at a calm pace. This

obviously has something to do with the satisfaction of being able, for the first

time in months, to put something down on paper without outside help, and also with

the fact that this is real "activity", as opposed to the other, passive pursuits such

as reading and watching television.

So far the head movements have not been noticeably tiring, and directing the light

beam on to the cells is no problem: it is possible to keep the beam in position for

the required half second and then to move on to the next letter. The pace is calm

and regular, the clocking of the relays reminding one of the ticking of a Friesian

clock.

After two days the number of wrong strokes has dropped to a very acceptable level

(one error in two lines I now regard as rather a poor score, due to carelessness

and/or tiredness). This makes me wonder whether it would be possible to shorten

the exposure time without affecting the present result.

3 November 1976

My findings after twelve days are still just as favourable. Fatigue is not a signi

ficant factor. Last Saturday I even worked four hours in one stretch and this ob

viously has much to do with the tremendous stimulus of being able to take up contact

again, to take my own initiatives. In short, a tonic. The ache in the neck I spoke

about some days ago has gone, without there having been any need to change the

position of the panel. It seems I am attached to this set-up.

2 December 1976

Experience with the typing is unchanged after six weeks. The number of mistakes I

make is small, one in every ten lines. There are hardly ever any wrong letters or

numbers; almost always the error is forgetting a space between two words and "new

line" after the break sign. It should be remembered that the typing speed is very

leisurely, much more so than normal typing. It is also possible that I am unconscious

ly influenced by the certainty of being able to make corrections in good time:

mistakes can be put right.

13 March 1977

In essence, what I reported earlier is still fully applicable after five months: I use

the apparatus very regularly, at least a couple of hours a day, and I find it rela

tively easy. An important point is, of course, that for me this is the only means of

expressing myself in writing (and since speech is becoming more difficult for me,

126 written communication is in any case receiving more emphasis). By this t mean that

an outsider, accustomed to communicating without difficulties in writing and by word

of mouth, probably has a mistaken idea about what is easy and not easy, since he

will always be inclined to compare the apparatus with a pen or a normal typewriter.

For me this comparison is meaningless and therefore I never make it; all that matters

to me is whether, given my circumstances, I can do with this machine the work I want

to do in the time available to me, and my conclusion is then that this is completely

the case.

6 September 1977

After more than ten months of use there is no reason for me to revise what I have

said earlier. This implies that working with the machine is still for me the ideal

solution, and that I am unable to indicate any way in which the apparatus might be

improved - apart from the points I mentioned earlier, but none of which affect the

principle. Compared with the earlier situation, my head is directed more downwards

during operation. The complaints have tended to increase since then, and my head is

much more often inclined to droop. This tendency seems to run just about parallel

to my general physical and above all mental condition at a particular moment. The

result of this is remarkable: sometimes I find it completely impossible to direct

the beam onto the letters, but when I am in a good condition I am still able, for

tunately, to work with the apparatus for hours on end.

The purpose of this evaluation is therefore rather to draw attention to certain

developments in my disease, insofar as they relate to the use of the machine. In the

past period, for example, I have noticed a distinct weakening of the throat and neck

muscles, and speech - already rather a problem since May 1976 - has now become prac

tically impossible.

It was only after considerable hesitation that I decided to use the machine as an

instrument for speech as well as for written communication. Up to last June I was

still able to speak a little, although it was tiring and also painful. After that

my speech rapidly deteriorated and even my wife was often no longer able to understand

me. This brought a sense of growing isolation, with all its associated problems and

frustrations. Even so, it still seemed as if breaking through this by resorting to

the machine implied an admission that I had once and for all given up the use of the

spoken word. I mention this here because I have repeatedly noticed in myself this

desire to hold on to a vanishing function right up to the very last moment. Possibly

it is a common phenomenon in invalidity; in that case it ought to be taken into account

in getting invalids accustomed to the use of aids such as this typewriter (and speech

communication) machine.

However this may be, my resistance was finally overcome and for about two months I

have regularly been using the machine for conversations. The normal arrangement is

for my conversation partner to sit beside me and to read the display screen while I

write. I appreciate it if he completes a word or a sentence so that I need no longer

go on writing. I must not give the impression that conversing in this manner is without

problems: the writing pace is 55 to 60 strokes per minute, that is to say ten words a

minute, which is truly a snail's pace and requires a great deal of patience, particular-

ly from the conversation partner (for the invalid himself the patience factor is not 127

such a problem, because he has already had long training in the exercise of patience) •

Depending on the nature of the conversation, the "atmosphere" in which it took place

and the degree of mutual understanding, I have in this way been able to conduct many

conversations very satisfactorily. I have noticed that, although it is necessary to

be somewhat sparing in expression, it is not a good thing to resort to a kind of

telegram style, because in that way the conversation is not given a chance and gets

no further than the exchange of information. I have a real problem only with a con

versation partner who "does not listen", that is to say goes on speaking while I write.

I think there are various reasons for this. In the first place, there is an enormous

difference of speed between his share in the conversation and mine, which does not

help the homogeneity of the co~versation. Furthermore, I am forced to divide my

attention because I have to write and listen at the same time. And finally, my re

marks when they finally appear on the screen have sometimes ceased to be relevant:

the partner is already so far ahead that he either misinterprets them or completely

fails to understand them.

In addition, the wish arose to be able to converse in other parts of the house or

outside it where the machine is not available. For this purpose I use a piece of

cardboard on which the panel is copied, and with my head lamp I form words in the

same way as with the machine, so that while my conversation partner follows the

panel he "hears" what I have to say. It is quite a useful alternative, but does not

compare with conversing on the machine.

Discussion

For the user in question the typewriter operated by the directing of a lamp fixed to

his brow has proved to be a unique and fundamental aid. Practically from the very

beginning he was able to use the apparatus very easily and to express himself with it

in writing at speeds of the order of 60 strokes a minute and with very few mistakes.

More especially during the initial period the facilities for making corrections proved

to be important, since they gave the user the chance to correct his mistakes him

self before they could be noticed by others.

The apparatus provided some compensation for the lack of hand and arm control. When

at a later stage the patient gradually lost his power of speech, the typewriter

was also used for direct communication with a conversation partner. This kept the

user mentally active and motivated.

In what way can a lightspot-operated typewriter be brought within the reach of those

who need such an aid? On the basis of experience gained, preferably with several

prototypes and different users, it would be necessary to make modifications to the

design.

The important point is to make as much use as possible of commercially available

equipment. Components to be newly designed will have to be studied in the prototype

stage with a view to production costs, ruggedness, reliability and service. When the

128 prototype is ready for production it will have to be tested in practice by future users.

There will also be problems of finance, organisation and distribution to be solved,

and this will involve building up a large number of contacts (Bouma, Engel, Melotte,

1972; Melotte and Leopold, 1976).

In the present case some of these problems were recognised earlier, and some models

of an improved version of the prototype described in 1973 by Soede have already been

tried out in practice. Given the nature of the problems to be solved, it is evident

that a fairly long development process will be needed. However, since the prices of

the commercial equipment used here are steadily falling, and in view of the possibi

lity of a coordinated effort, it should be possible within a few years to make a

head-movement-operated typewriter generally available for handicapped persons in

need of such an aid.

As regards other capabilities of the equipment, a facility should be considered

for making it possible to add or erase something at any point in the display. This

would enable the user to produce a difficult text, requiring frequent recasting be

fore reaching its final form. Various calculations, such as addition and multipli

cation operations, might also be displayed and carried out on the monitor. Another

possibility would be to make provision for games such as chess, crossword puzzles

and even ludo to be played, after some practice, on the monitor.

Finally, there are questions concerning associated equipment or associated possibi

lities. In the first place, a photocell panel can also be used as an operating panel

for other aids. Fur example, patients unable to spell (children, for instance) would

be able to make wishes known by means of pictures, and acoustic signals might also

be transmitted, for example via previously stored speech signals. In this connection

we had an earlier idea of a limited set of "vital sounds" (one per key) which could

represent an extreme stage of communication. At a somewhat less extreme level, one

might consider remote control of various kinds of equipment.

On the panel described here one photocell operates a horn. This can reduce the

user's feeling of isolation when he is working by himself. It ~s a device that can

easily be expanded to meet the user's personal wishes. Examples are the signal for

operating the page-turning device, for automatically dialling a telephone number,

or for switching a television set, radio set or cassette recorder on and off.

Such facilities would give the handicapped a greater degree of self-reliance.

References

Bouma, H., Engel, F.L. and Melotte, H.E.M. (1972) Technological devices for thevisually handicapped: Gap between research effort and available aids, I.P.O.Annual Progress Report, 2, p.46-S4.

Soede, M., Stassen, H.G., van Lunteren, A. and Luitse, W.J. (1973) A lightspotoperated typewriter for severely physically handicapped patients, Ergonomics, ~,p.829-844.

Melotte, H.E.M. and Leopold, F.F. (1976) Development of aids for the perceptuallyhandicapped, I.P.O. Annual Progress Report, 11, p.116-119.

129

The IPO relief-drawing set

H.E.M. Malotte

Based on earlier experiences, an improved relief-dra~ing set has been developed by

the Institute of Perception Research. This design, using injection moulding technique,

is better suited for large-scale production, ~hich was in fact started in 1977.

By writing with some pressure with an ordinary ball-point, the IPQ relief-drawing

set makes it possible to produce durable, embossed dra~ings that are immediately

tangible at the written side of the special plastic dra~ing sets.

The major applications of the relief-drawing set are seen as:

- an educational aid for the visually handicapped, emphasising abstract functioning,

e.g. to be used for the learning of writing, drawing, mathematics, geography,

games, music, etc.,

a means for exchanging written messages between people with normal vision and

people with visual impairments and

- a scratch pad for the elderly blind who find it difficult to learn Braille.

The new relief-drawing set can be obtained from:

The Dutch Association for the Blind (VNBW),

130 Kipstraat 54, Rotterdam, The Netherlands.

IPO annual progressreport 12 /977

Instrumentation

131

Digital equipment: a number of examplesL.F. Willems, G. Moonen, C. Lammers, J. Dobek, A. van Nes and H. Jimenez Nichols·

Introduction

The character of electronic instrument design is in the process of drastic change

due to the breakthrough of digital MSI and LSI circuits in particular. It looks as

if analog circuitry is losing more and more ground: even the T.V. receiver and the

audio tape recorder are in the course of being digitalized. An important factor in

this process seems to be the design philosophy of a preference for digital solutions

to electronic problems.

The advantages of digital circuitry are often quoted and they are probably all valid.

But the question then arises: has analog circuitry still a future in the design of

instruments. We have no answer to this problem; we confess, however, that in our

laboratory we,too, are biased towards a digital design philosophy.

In this contribution we describe four apparatuses that are based on digital methods

which we could not imagine some years ago. The devices to be discussed are:

- A digital moduZator for accurate and stable modulation of two analog signals, to

be used in psychoacoustic experiments.- The varidac, a signal gate with a number of possible slope wave forms: linear

slopes, cosine-shaped slopes or gaussian slopes.

The digital audio Zoop is a digital memory for storing speech waveforms (with a

maximum duration of 1.6 sec.) and provides means for flexible retrieval of thewaveform.

- A four-formant speech synthesizer that uses MSI digital filters.

Accurate four-quadrant modulator

In psychoacoustic experiments, equipment has to meet very stringent requirements, one

of the weak links being modulators. Analog multipliers yield nonlinear distortion

which limits the performance of the experimental set-up. These difficulties can be

overcome by digitalization of the modulator. With off-the-shelf components anydesired accuracy of the modulator can then be obtained.

A block diagram of the modulator is shown in Fig. 1. The design is straightforward:

the input signals are sampled by a sample-and-hold switch and converted by a 12-bit

ADC to digital numbers; these numbers are digitally multiplied in an MSI circuit

(TRW:MPY-12AJ, the circuit being used quite inefficiently here). After multiplication,the result is rounded to a 12-bit number, which is then converted to an analog

voltage by means of a 12-bit DAC followed by a sample-and-hold circuit.

The timing of the various units in this modulator is taken from a PROM memory and

an address counter. The highest possible sampling frequency is 60 kHz. There are no

filters incorporated in the modulator.

132 * Student P.I.I.


DIGITAL

MULTIPLIER

MPV-12AJ

TIMING PROM

24 12ROUINDt-H"""" DAC

Fig. 1. Block diagram of the four-quadrant modulator.

The varidac

In psychoacoustic and speech perception experiments signal gates with a high on-off

ratio (at least 60 - 70 dB) are required. The most critical point in such a gate is

the signal leakage in the gate-closed condition. Analog circuits performing the

gating function therefore need accurate balancing to remove the influence of un

equal components in the circuit and the change in these components due to aging and

temperature. For the envelope signal generator the same demands exist: in the

gate-closed situation the dc voltage controlling the gate must stay within a few

millivolts of zero volt.

In the design of the varidac these problems were short-cut by using a multiplying

DAC and a digital envelope signal generator. A multiplying DAC is obtained by using

a DAC whose reference voltage (of both polarities) can be applied externally. In the

varidac this reference voltage input is used as an input for the analog audio signal.

The voltage is then multiplied by the digital number fed to the DAC in the usual way.

The envelope signal is generated by accumulating stored increments or decrements

from a PROM memory (1024 x 4 bits) in an ALU (Arithmetic Logic Unit). The clock

frequency of this process determines the duration of the slope of the envelope

signal. This duration can be chosen independently for the rise and fall of theenvelope.

The following waveforms can be chosen for the envelope signal:

- rectangular

- linear slopes

- cosine-shapes slopes

- gaussian-shaped slopes

- gate continuously open (for testing purposes).

A block diagram of the varidac is shown in Fig. 2. The clock frequency for the

timing process of the PROM is provided by an address counter. With the switches t1

and t3 the slope durations can be selected (2,5 ms to 40 ms), while with a switch 133

on the main clock divider all timing functions can be slowed down by a factor of 10.

The linear waveform is obtained by connecting the output of the address counter

directly to the multiplying DAC.

pf--INPUT

~- flDAC I-- LPFFOLLOWER

10 bit

f

ALUr+-

4bit

CLOCK ADDRESS

10MHz 25hsoI- PROMCOUNTER I-- f-+-

lj" 10t1 t3

SHAPE

~2

TIMINGPROM

I--

Fig. 2. Block diagram of the varidac.

Digital audio loop

A basic drawback of a tape loop mounted on a tape recorder for providing repeating

speech fragments is that it cannot be synchronised with other equipment. Storing

the speech waveform in a digital memory provides, among other possibilities, imme

diate access to any part of the recorded speech and, to a large extent, obviates

the synchronising problems.

In the design discussed here the speech waveform is quantised by means of a 10-bit

ADC and stored in a 16k x 10 bit memory. The obtained signal to quantisation noise

ratio also compares favourably to the signal to noise ratio of tape recorders.

Access to the speech waveform is obtained by addressing the digital memory and

routeing the speech samples to a DAC. A speech fragment of about 1.6 sec. can be

134 stored with a sampling frequency of 10 kHz.

A block diagram of the apparatus is shown in Fig. 3. The memory addressing is performed

with the aid of an address counter so that the successive memory locations can bereached. This address counter can be preset at a certain position by means of a

number of switches. There are two sets of these switches and a remote entry. At an

externally supplied starting pulse the address counter is loaded by the corresponding

set of switches and started at that position.

DIG. CASSETTE DIG. CASSETTE

CLOCK

16 K10 MEMORY

ADDRESS COUNTER

STOP

REMOTE

If (l}-----j"----_....

REMOTE

Fig. 3. Block diagram of the digital audio loop.

The output of the address counter is compared with the setting of a third set of

switches: a pulse is generated when the address counter reaches the value set by

these switches. This pulse can be used to stop the cycle, to restart the same cycle

or start another segment, or even to start an external device.

The digital audio loop can also be used as a variable delay for speech signals. To

this end the signals are stored in memory and read out again after the delay timehas elapsed. 135

There are no filters incorporated in the device, so that sampling frequency can bechosen freely.

A four-formant speech synthesizer

A digital speech synthesizer, based on four formant filters connected in series,

was built with the aid of MSI circuits, on each of which two digital filters are

realized (TMC 539).

The synthesizer has to produce speech from formant-coded speech data, with a sample

rate of 8 kHz. The coded speech data can be obtained favourably by using LPC methods

(cf. Vogten and Willems, this issue). In the synthesizer these formant data areused as coefficients for the digital filters.

A block diagram of the synthesizer is shown in Fig. 4. The two digital filters in

one circuit block require the serial storage of 120 bits of coefficient data, hence

each filter pair has an associated shiftregister 120 bits long.

NOISE

ENERATOR

F1 F2 F3 F4 DAC LPF

SAWTOOTH

GENERATOR

FO AN AV F1 81 F2 82 F3 83 F4 84

~ TIMING

COEFF. MEMORY ~ PROM

/'

INPUT

Fig. 4. Block diagram of the speech synthesizer.

For unvoiced speech a maximum-length-sequence generator is used as a noise source.

The sequence length being 215_1 samples. For voiced speech a sawtooth generator is

used, which is simply an accumulating register to which the pitch frequency {Fo) is

added, the register being reset as soon as the accumulated sum exceeds the sampling

136 frequency.

A controlled voice switchu.o. Schroder

Introduction

In our visual experiments stimuli are presented to a subject and we want to know

something about the processing time required by the subject. From the experimenter's

point of view the most convenient way would be to measure the response latency by

means of a push-button. However, this yields possible anticipation of the stimulus

and results in an extra burden on the subject carrying out the task.

From the subject's point of view a voice switch is more convenient and there is no

anticipation, but a voice switch responds sometimes to erroneous sounds, like closing

a door, breathing or opening of the lips of the subject. In spite of these diffi

culties we pick the voice switch to measure our response latencies, and in this con

tribution describe how we control the correctness of the measured response latencies,

and a number of testing experiments on the performance of the voice switch.

Form of the voice switchVersion I

In a first version of a voice switch control we started a 1 kHz sound when the

stimulus was presented to the subject. At the moment the voice switch detects a

voice onset, this 1 kHz sound stops and a control lamp is flashed on. During the

session the 1 kHz sound was recorded on the left channel of the tape recorder,

the voice of the subject on the right channel. After the session, the correctness

of the performance of the voice switch was checked by listening to the tape, in

correct response latencies were deleted from the data. It was noted that this control

method didn't work too well, for, listening to the tape once again previously un

detected errors of even SOD msec or more were found.

Version"

We changed our cheap dynamic microphone for an electret microphone and found that a

flat frequency response is very important for voice switch performance. The voice

switch control was also changed. Using a switch activated by the voice switch, the

microphone now switches to the other channel of the tape recorder on detection of a

voice response, and switches back on detection of the end of the response, that is

if there is no detected voice for 180 msec.

Therefore only the sounds that the voice switch reacts to are recorded on the right

hand channel, and when listening to the right-hand channel for incorrect response

latency measurements only the erroneous sound is audible.

As an example let the stimulus be 'hat', and the response of the subject for example

is "hat or fat". On the left channel are audible: The pushing of the button, the

noise of changing the stimulus, and the first part of the onset of the letter "h"

followed after a short silence by "or fat". On the right channel there is audible

"hat",or "at" if the voice switch didn't respond to the "h" quickly enough. It is

therefore also possible to obtain information on the performance of the voice switch 137


on letters like "h". The first aim of our control was to facilitate improvement and

testing of the voice switch, so that now, using this voice switch plus control, we

are able to exclude incorrect measurements (typically 75%) from the data in a veryconvenient way.

Testing the controlled voice switch

Three experiments have been done to get an idea of the errors introduced by usingthe voice switch.

During experiment I, randomly flashed lights were used and the subject had to res

pond each time directly after the flash, with a known word. This was repeated 40

times, then another word was chosen. For reference purposes the subjects were also

asked to tap with a pencil instead of responding with a word.Experiment II was almost the same as experiment I, however, the subject named alter

natedly one of two words. In experiment III, words were presented foveally in a

tachistoscope as long as the subject needed to respond to the word (words were random,

one out of five).

In the first two experiments the Dutch

words were: lak, hak, zak and pak; in

the third experiment: hak, zak, tak,

tik and pak.

The averaged response latencies are

shown in Fig. 1. There seems to be

little difference between the voice

switch response on the initial

letters 1, hand z, but the response

latencies on the letter p are, sur

prisingly, about 15 msec delayed.

This delay was expected on the res

ponse latencies of the letter hbecause there is an audible difference

between the onset of the "h" andthe moment the voice switch responds.

The conclusion is that measured voice

onset times are not uniquely related

to the availability of the response

of the subject. A clear dependence on

the first letter of the word exists,

taktikpakhak lak zak

words used

• reference: tapping with a pencil

20

40

ms60

-20

-40

up to say 60 msec for certain sub

jects. If the subject wishes to

pronounce the letter 'p', more time

is needed than for the letter 'h',

because in pronouncing the letter 'p' pressure must be built up in the lung, whichis not necessary for a letter like 'h'. Therefore an audible improvement of the

voice switch on the letter 'h' would result in even greater differences in response

latencies between 'p' or 'h' while, ideally, there should not be any difference at

Fig. 1. Differences in the measured responselatencies found in three experiments. Experimental points represent the average of fivesubjects.

138 all.

Behaviour of the latency distributions

Except for letters like 'p', there is no sharp onset of the voice, a gradual onset

causes fluctuations in the measured response latencies, while for the letter 'h'

the 'onset time' is 100 msec. It might therefore be expected that the voice switch

introduces an extra uncertainty in the response latencies, resulting in an increased spread of the distribution.

msexpI ms

80 s 80 s

~ereference:tapping with a pencil

~

60 60

40

20

hak lak zak pak

~ words used

40

20

hak tak pak tik

~ words used

expm

zak

Fig. 2a. The standard deviations (s) ofthe measured response times in experimentI for each of the five subjects werequite the same for all subjects and words.(The mean response latency was 250 ms.)

Fig. 2b. The standard deviations (s) ofthe response latencies in experimentIII were greater than in experiment I.(The mean response latency was 320 ms.)

In experiments I and II the standard deviation was about 3S msec for all words and

subjects (Fig. 2a) and the standard deviations of spoken words were S msec higher

than for the mechanical task. In experiment III the need for recognition of wordsintroduced a much higher increase in the standard deviation (Fig. 2b) than in ex

periments I and II.

Our conclusion is therefore that the voice switch errors are negligible in visual

recognition experiments where the standard deviations are 100 msec or more.

139

140

IPO publications

IPO publications 1977

P 309

P 310

P 311

P 312

P 313

P 314

H. Duifhuis

Cochlear Nonlinearity and Second Filter: Psychophysical Estimation of ModelParam~ters.

J.Acoust.Soc.Amer., 1976, ~, p. S39

Abstract of a paper contributed to the 92nd meeting of the Acoustical Society of America, held at San Diego, November 1976.

H. Duifhuis and W.F. Simons

The Critical Band: its Relation to Cochlear Nonlinearity and Second Filter.

J. Acoust.Soc.AJ1\er., 1976, ~, p. S39-S40.

Abstract of a paper contributed to the 92nd meeting of the Acoustical Society of America, held at San Diego, November 1976.

A.F.V. van Katwijk

Implicit Knowledge of Pitch Patterns in the Perception of Accented Syllables.

Lowen und Sprachtiger.Ed. Rudolf Kern, Leuven: Peeters 1976, p. 385-394.

It is not simple acoustical attributes which constitute the cues by whichnative speakers of Dutch pick out accented syllables in spoken language. Itis rather pitch movements that are integrated parts of intonation patternsthat elicit accent judgments consistently and non-trivially.

D.G. Bouwhuis

Recensie van Psycholinguistics door N.H. Markel (Ed.).

Massa Communicatie, 1976, i, p. 235-237.

F.1. Engel

Visual Conspicuity, Visual Search and Fixation Tendencies of the Eye.

Vision Res., 1977,..!.2, p. 95-108.

The cumulative probability of target discovery during search has been related experimentally to the relevant "conspicuity area", the visual fieldin which the target can be discovered after a single eye fixation. Duringsearch, "non-targets" were found to be fixated spontaneously in proportionto their conspicuity area.Further small spontaneous eye fluctuations are described that occurred,during determination of the conspicuity areas, in the direction of the targetdiscovered. Their occurrence and delay depended on the target eccentricityand the size of conspicuity area.The results emphasize the relevance of the conspicuity area to research onvisual selection.

Ch.P. Legein and H. Bouma

Dyslectic and Normally-Reading Children.I. Exploration of a Letter-Search Test for Screening PurposesII. Follow-up and further Exploration in 4 Weak and 4 Normal Readers on

Letter, Word and Number Recognition.

Documenta Ophthalmologica, 1977, il, p. 391-396.Exploratory experiments compare dyslectics to normally reading children. Ina letter search test, the dyslectics scored lower than the controls, butboth groups gave lower scores for letters in words than for letters in unpronounceable strings. As to reading level, both groups read better than oneyear earlier, but only the dyslectics increased in parafoveal word recognition.Short term memory scores for visually presented digits were worse in dyslectics, whereas for auditorily presented digits there was less difference.It is advocated to study component processes of reading not just in isolationbut rather in their mutual dynamic dependencies. 141


142

P 315

P 316

P 317

P 318

P 319

L.P.A.S. van NoordenMinimum Differences of Level and Frequency for Perceptual Fission of ToneSequences ABAB.

J. Acoust.Soc.Amer., 1977, B, p. 1041-1045.

Stream segregation or fission of the fast alternating tone sequence ABABis known to occur if there is a sufficient frequency difference between thetones A and B. In this paper it will be shown that level difference insteadof frequency difference can be sufficient to enable the occurrence of fission.The smallest level difference between A and B, ~L~3 dB (2.5-10 tones per sec;tone duration 40 msec). At rates faster than 12 tones per sec a new perceptivephenomenon was observed: the roll effect. It is characterized by the weaktones being heard at double the tempo. The relation with the continuity effectis investigated using alternating sequences with both level and frequencydifference between the tones as stimuli.

J. 't Hart

Vers une Base Psychophonetique de la Stylisation Intonative.

Actes des 8emes Journees d'Etude sur la Parole.Aix-en-Provence: GALF 1977, Vol. 1, p. 167-173.

Nous avons fait des experiences nombreuses dans lesquelles il apparaissaitpossible de styliser des courbes de la frequence fondamentale etonnammentfort, sans perdre "l'equivalence perceptuelle" avec les intonations originales.

Les limitations au pouvoir separateur pour des frequences changeantes rapidemment pourraient-elles expliquer cet effet?Des considerations sur Ie seuil differenti&l de l'etendue d'un mouvement deF , celui de sa position dans la syllabe, et celui de sa pente, en connexiona~ec Ie seuil de glissando montrent que les differences entre les courbes deF naturelles et leurs stylisations comme communement appliquees res tents8bliminales en grande partie.

R. Collier

La Perception de l'Intonation Anglaise par des Anglophones et Neerlandophones.

Actes des 8emes Journees d'Etude sur la Parole.Aix-en-Provence: GALF 1977, Vol. 1, p. 139-146.

Dans la perception de l'intonation, c.a.d. des variations du fondamental dela langue parlee, il est possible de faire abstraction de certaines differences physiques pour aboutir a une classification globale des contoursmelodiques en un nombre limite d'intonations de base. Cette faculte d'abstraction intonative a ete etudiee chez des sujets anglophones et neerlandophonesqui devaient ~rie~vingt phrases anglaises selon leur propre critere de ressemblance melodique.

D.J.H. Admiraal, B. L.Cardozo, G. Domburg and J.J.M. Neelen

Annoyance Due to Modulation Noise and Drop-Outs in Magnetic Sound Recording.

Philips Technical Review, 1977, iI, p. 29-37.

It is possible to carry out many kinds of physical measurements with greataccuracy on a product intended for human use and still not obtain a conclusiveanswer to the question of the product's usability. This is because humanperception also enters into the picture. If the investigation is extendedto include a representative number of human subjects it will be discovered,however, that human perception obeys certain laws. These can often be quantified, as has been done for example in the theory of the chromaticity diagramand in the international definitions of loudness. More particularly in thecontext of noise abatement, a further step has been taken and efforts havebeen made to express the concept of annoyance in numerical terms, leading toreproducible results. Something of the same sort is attempted in the articlebelow, which deals with the annoyance caused to the listener by two imperfections of magnetic sound recording that are hard to avoid: modulationnoise and the spontaneous occurrence of short interruptions or "drop-outs".

D.G. Bouwhuis

Recensie van: Leesbaarheid: onderscheiden, opnemen en verwerken doorJ.M. Dirken.

Massa Communicatie, 1977, ~, p. 152.

P 320

P 321

P 322

P 323

H. Duifhuis

Cochlear Nonlinearity and Second Filter: a Psychophysical Evaluation.

Psychophysics and Physiology of Hearing.Eds. E.F. Evans and J.P. Wilson, London :Academic Press 1977, p. 153-163.

The class of models consisting of a linear first filter, a time-invariantnonlinearity, and a linear second filter appears to be reasonably successfulin describing phenomena like sharpening, two-tone suppression and combinationtone generation. We have analysed such a model (J.Acoust.Soc.Amer., 59, 408423, 1976) and we were able to make, a.o., certain quantitative predictionsabout two-tone suppression. The response of the model to a probe + maskercomplex (as a function of masker level; probe level fixed; masker off CFand probe at CF) would contain separate quantitative information on firstand second filter as well as on nonlinearity. We measured such responsespsychoacoustically using the pulsation threshold. The observed results showobvious similarity to the expected results. Amounts of suppression of over30 dB have been measured. However, we also found certain systematic deviationsbetween data and theory. These seem to indicate that at any rate the assumptionthat the first filter is linear, is questionable. Thus, a confrontation ofthe predicted results with the present data at best gives a crude firstorder approximation of the parameters to be estimated.

B. Leshowitz and R. Lindstrom

Measurement of Nonlinearities in Listeners with Sensorineural Hearing Loss.

Psychophysics and Physiology of Hearing.Eds. E.F. Evans and J.P. Wilson, London: Academic Press 1977, p. 283-293.

Nonlinearities characterising the auditory system with cochlear pathologyare examined in several investigations of the psychoacoustical tuning curve.In regions of threshold elevation, the curve is extremely broad due to thedisappearance of the finely-tuned segment and displays a "notch" in thevicinity of the probe corresponding to the greatly diminished effectivenessof the masker in this region. For observers with abrupt high-frequency loss,combination tones are not evident for placement of masker and probe in regionsof normal hearing. A concommitant of this abnormal frequency response is amarked reduction of neural suppression revealed in forward-masking experiments with bandlimited noise. The data are consistent with a physiologicalvulnerability of the second filter.

H. Bouma

Visuele Maskering in het Leesproces.

Z.W.O. Jaarboek 1976, p. 112-119.

P.A. Vroon, H. Timmers and S. Tempelaars

On the Hemispheric Representation of Time.

Attention and Performance, vol. 6.Ed. S. Dornic, Hillsdale: Erlbaum 1977, p. 231-245.

It is well known that subjective duration is related to cognitive processes.Also, differences between the information processing and analysing systemsof both cerebral hemispheres have been reported. This study attempts todetermine whether or not the left brain is superior in the encoding of time.In Experiment I the subjects reconstructed the durations of simple reactiontime tasks carried out predominantly by either the left or the right brainhalf in the visual and auditory modality. It appeared that the variances ofthe time estimates of the right brain considerably exceeded those of the left.Consequently, there is a relatively great time uncertainty.

P 324 H. Timmers and W.A. Wagenaar

Inverse Statistics and Misperception of Exponential Growth.

Perception and Psychophysics, 1977, ll, p. 558-562.

Exponential growth presented by numerical series or graphs is grossly underestimated by human subjects. This misperception was considerably lessenedby presenting decreasing functions; this conclusion holds for both numericand graphic stimuli. In the numerical conditions about 25% of the subjectsperformed according to the statistical norm. In contrast with previous results, considerable individual differences with respect to sensitivity forrate of growth were observed. This finding was interpreted in terms of task 143

P 325

difficulty: Extrapolation of ascending series is too difficult a task to bediscriminative. Extrapolation of descending series is much easier, and maytherefore better discriminate among subjects.

H.C. Bunt

The Formal Semantics of Mass Terms.

Papers from the third Scandinavian Conference of Linguistics, held atHanasaari, October 1-3, 1976.Ed. F. Karlsson, Turku: Academy of Finland 1976, p. 81-94.

The paper discusses the basic conceptual and formal problems connected withthe semantic interpretation of mass terms. A conceptual analysis is presented which departs from the classical analysis due to Quine. The conceptual analysis which is put forward instead is formalised in terms of anewly developed extension of set theory, called ensemble theory. An axiomatic formulation of ensemble theory and a survey of its most important theorems are given in an appendix. It is shown that by analysing mass terms inthe way proposed here many of the notorious semantic problems connectedwith them can be solved. The best developed alternative approach, due toParsons is shown to be fundamentally inadequate and inconsistent.

144

Papers accepted for publication

MS 286 L.L.M. Vogten

Simultaneous Pure-Tone Masking: the Dependence of Masking Asymmetries onIntensity.

To appear in: J. Acoust. Soc. Amer.

Phase locking between probe and masker was used in a series of pure-tonemasking experiments. The masker was a stationary sine wave of variablefrequency; the probe a fixed-frequency tone burst. We have observed thatfor small frequency separation the masking behaves asymmetrically aroundthe probe frequency. This asymmetry depends on intensity. For a 1 kHz probeat low stimulus levels there is a maximum masking effect at about 60 Hzabove the probe frequency, whereas at high levels maximum masking is produced at a frequency definitely below the probe frequency. These resultsare discussed in relation to current neurophysiological and psychophysicaldata. For the high-level asymmetry possible interpretations are suggestedin terms of two changes in the excitation pattern of the basilar membrane,(a) a shift of the top and/or (b) a slope asymmetry, both increasing withlevel. The low-level asymmetry will be treated in a second paper.

MS 294 L.L.M. Vogten

Low-level Pure-Tone Masking: A Comparison of "Tuning Curves" obtainedwith Simultaneous and Forward Masking.

To appear in: J. Acoust. Soc. Amer.

Simultaneous and forward pure-tone masking are compared, using a fixedlevel probe of 20 msec and a 200 msec masker. For a 1 kHz probe of 30 dBSPL the required masker level, Lm, is measured as a function of the timeinterval L'lt between masker offset' and probe onset. When masker and probehave equal frequencies a monotonic relationship is found for phase ~n

but not for phase O. When the masker frequency, f , is 50 or 100 Hz beZo~

the probe frequency, f , a nonmonotony is found, With a minimum at L'lt = 0,the transition betweenPsimultaneous and forward masking. When f is 50 or100 Hz above f , however, the relationship of L to L'lt is monot~nic.In the case ofPsimultaneous masking the iso-L ~urves, which give L as afunction of f , show a typical asymmetry aroBnd f = f , leading t~ thepositive shiftmof the maximum masking frequency, MffiF, p¥eviously reportedfor stationary pure-tone maskers. In the case of forward masking, however,this asymmetry ceases to exist. We conclude that simultaneity of probe andmasker is a necessary condition for the occurrence of a low-level positiveMMF shift.

MS 300

MS 310

MS 312

The results are discussed in the light of psychoacoust1cal and neurophysiological data on two-tone suppression. A possible interpretation of the nonmonotony and of the positive MMF shift is suggested in terms of the physiological asymmetry in two-tone suppression.

S.G. Nooteboom, J.P.L. Brokx and J.J. de Rooij

Contributions of Prosody to Speech Perception.

To appear in: Studies in Language Perception; Proceedings of the Symposiumon Language Perception, held at Paris, July 18-25, 1976.

In this paper empirical data relating contributions of prosody to recognition in speech perception are discussed. A first type of data 1S concerned with the perception of prosodic continuity of the attended voice.A sequence of speech sounds mayor may not be heard as a continuous streamof speech produced by a single voice. It is shown that both continuity inperiodicity pitch and continuity in spectral composition contribute to perceived prosodic continuity.A second type of data has to do with a contribution of speech rate in theimmediate environment of a test segment to the phonemic perception of thissegment as a short or a long phoneme. The data can be explained by assumingthe existence of backward perceptual normalisation of segment duration tothe temporal structure of auditory information coming later in time.A third type of evidence is related to the contributions of prosody of theperception of specific linguistic information. Speech prosody potentiallycarries information on lexical, syntactic and semantic aspects of themessage. A number of investigations both in our own institute and othersreported in the literature show that listeners may actually use this information when they need to.

H. Bunt

Ensembles and the Formal Semantic Properties of Mass Terms.

To appear in: Mass Terms: Some Philosophical Problems.F.J. Pelletier (Ed.), Dordrecht: Reidel 1978.

This paper presents an analysis of the formal, i.e. non-lexical semanticproperties of mass terms, which forms the theoretical basis for thehandling of mass noun expressions in the PHLIQA 1 question answeringsystem.The paper starts out with a discussion of what distinguishes mass nounsfrom count nouns, from a syntactic as well as from a semantic point ofview. Existing proposals to define this distinction are critically examined.As defining characteristic of mass nouns is proposed the property ofhomogeneous reference: a mass noun refers in such a way that no particulararticulation of the referent into parts is presupposed, nor the existenceof minimal parts. To the notion of homogeneous reference a precise meaningis given by formalizing it in terms of a mathematical formalism, developedespecially for the study of mass terms, called ensenible theory. Ensembletheory deals with mathematical objects, called ensembles, that are characterized by their parts (sets are special cases of ensembles). The paper contains an informal introduction to ensemble theory and a listing of theaxioms on which the theory is founded. Mass adjectives are defined as thoseadjectives that denote a homogeneous property, a notion which is also madeprecise in terms of ensemble theory.Based on the notion of an ensemble, a formal language for the semanticrepresentation of expressions containing mass terms is defined. A simpletheory of amounts, containing just the minimal ingredients needed fordealing with quantified mass noun expressions, is also incorporated inthis language. Using the logical properties of ensembles and amounts,it is shown that the proposed semantic representations of expressionscontaining mass terms do account for the formal semantic properties ofsuch expressions.

S.G. Nooteboom

Perceptual Adjustment to Speech Rate: a Case of Backward PerceptualNormalization.

To appear in: Album Hendrik Mol.Amsterdam: Institute of Phonetics.

The effect of speech rate on the phonemic perception of Dutch /a/ and/a:/ as a function of vowel duration is studied. The test segment was the 145

146

MS 314

MS 317

MS 320

vowel of the Dutch word taak (task), which, when shortened, is perceivedas tak (branch). The word was embedded in a sentence. The speech rates ofthe preceding part and of the following part of this sentence were variedindependently. It was found that an increase in speech rate of the followingpart of the sentence consistently leads to a decrease of the phonemeboundary in milliseconds. A decrease of speech rate had no such effect.In no case did the speech rate of the preceding part of the sentence haveany systematic effect on the phoneme boundary. The data are explained interms of the relative importance of the most recent auditory informationto perceptual normalization, and of an increase in the relative importanceof spectral cues as compared to acoustic duration in slowed down speech.

H. Bouma and F.L. van Nes

De Leesbaarheid van Lijnsegment Cijfers op Displays.

To appear in: Nederlands Tijdschrift voor de Psychologie en haar Grensgebieden.

Technical and economical developments have led to a rapidly increasing useof new media for text presentation. Little is known about the legibilityof the letter- and number shapes which are used for such electronic media.This paper deals with research on the recognizability of numbers built upfrom straight line segments, which therefore have a schematic form.Erroneous recognition of such numbers leads to confusion between them.The distinctive function of the individual line segments has been determinedfrom the errors. This analysis leads to an improved design of the numbershapes. Firstly, the improvements aim at increasing the discriminabilityof the numbers; secondly, improvement of their acceptability, i.e. resemblance with the usual number shapes, plays a role.

J. 't Hart

Looking for Rhythmical Structures Evoked by Isochronous Syllable Strings.

To appear in: Album Hendrik Mol.Amsterdam: Institute of Phonetics.

In 1975, we did an experiment on the relation between the location of thenon-final fall and syntactic structure. Stimuli were strings of "hummedsyllables" with pitch contours in which the location of the non-final fallwas varied. Subjects responded by making sentences to which the given pitchcontours would provide suitable fits.On intuitive grounds, it was later supposed that the response material contained systematic phenomena with respect to rhythmical organisation. Inparticular, the way in which rather long stretches of syllables betweenpitch accents seemed to be subdivided suggested a relation to total numberof syllables, location of pitch accents and of the non-final fall.An attempt has been made to find formal criteria according to which subdivisions can be assigned on the basis of the language material alone. Fairagreement with the intuitive approach can be obtained when using pitchaccents, lexical stresses and the alternation of full and neutral vowels,to score on a 'light-heavy' scale. The subdivisons, now based on thesescores, show essentially the same relations as mentioned above.It is concluded that intonation and total number of syllables of the stimulus material have elicited particular rhythmical organisations of theresponse sentences and that these organisations are reflected in thelanguage material chosen by the subjects. Meanwhile, their systematictrends seem to imply that the formal criteria applied to recover the rhythmicorganisation correspond to some psychological reality.

S.G. Nooteboom

Speaking and Unspeaking: Detection and Correction of Phonological and LexicalErrors in Spontaneous Speech.

Paper submitted to the Working Group on "Slips of the Tongue and Ear";12th International Congress of Linguists, August 31 - September 2, 1977.

An analysis of corrections of phonological and lexical speech errors inMeringer's corpus shows that: (1) most speech errors are corrected, phonological errors slightly more than lexical ones; (2) stops for a newstart are predominantly made at the first word boundary after the error,later stops being more frequent for lexical than for phonological errors;(3) inphonological errors new starts practically always go back to thelast word boundary preceding the error, in lexical errors often further.To explain these data a mental strategy is hypothesized which checks the

M5 321

M5 322

output speech for phonological orthodoxy of word forms, and the syntacticand semantic appropriateness of short phrases.

Ch.P. Legein and H. Bouma

Leesprocessen bij Leeszwakke Kinderen.

Verschijnt in: Proeven op de som, Psychonomie in het Dagelijks Leven.Janssen, Vroon en Wagenaar (Red.), Deventer: van Loghum 5laterus.

Wanneer men bij leeszwakke kinderen het lezen onderzoekt, blijkt niet aIleen dat zij opvallend trager reageren dan normaal lezende kinderen, maarook dat zij een kleiner leesgezichtsveld hebben. Deze gegevens zoudenwellicht van nut kunnen zijn bij het ontwikkelen van leestrainingsprogramma's.

H.E.M. Melotte en F.L. Engel

De IPO Relief-Tekenmap; van Idee tot Hulpmiddel.

Verschijnt in: Maandblad voor Revalidatie.

Korte beschrijving van de zogenaamde Relief-tekenmap; een op het IPO ontwikkeld communicatie hulpmiddel waarmee blinden en slechtzienden in blijvend voelbaar en zichtbaar relief kunnen tekenen en schrijven.

Reprints and preprints of IPO publications

Requests for reprints or preprints of publications listed on pages 141-147can be obtained from:

Library

Institute for Perception Research

P.O. Box 5135612 AZ Eindhoven

The Netherlands

147

Download - IPO ANNUAL - Eindhoven University of Technologyalexandria.tue.nl › tijdschrift › IPO 12.pdf · regards perceptual consequences. Vision luminance contrasts We aim at a quantitative

Top Related