computational auditory scene analysis: listening to several things at once

Computational auditory scene analysis: listening to several things at once Martin Cooke, Guy J. Brown, Malcolm Crawford and Phil Green

The problem of distinguishing particular sounds, such as conversation, against a background of irrelevant noise is a matter of common experience. Psychologists have studied it for some 40 years, but is is only comparatively recently that computer modelling of the phenomenon has been attempted. This article reviews progress made, possible practical applications, and prospects for the future.

In most listening situations, a mixture of sounds reaches our ears. For example, at a crowded party there are many competing voices and other interfering noises, such as music. Similarly, the sound of an orchestra consists of a number of melodic lines that are played by a variety of instruments. Nonetheless, we are able to attend to a particular voice or a particular instrument in these situations. How does the ear achieve this apparently effortless segregation of concurrent sounds?

E.C. Cherry [l] noted this phenomenon in 1953, and called it the ‘cocktail party problem’. Since then, the perceptual segregation of sound has been the subject of extensive psychological research. Recently, a thorough account of this work has been presented by AS. Bregman [2]. He con- tends that the mixture of sounds reaching the ears is subjected to a two-stage auditory

Martin Cooke, B.Sc., Ph.D.

Studied Computer Science and Mathematics at Manchester University, and received a doctorate in Computer Science from the University of Sheffield. He is currently a lecturer in computer science there. He has been active in speech and hearing research since 1962, and his research interests include speech segregation, speech coding and developmental speech synthesis.

Guy J. Brown, BSc., Ph.D.

He is a graduate of Sheffield Hallam University, and in 1992 obtained a doctorate in Computer Science from the University of Sheffield where he is now a lecturer in computer science. He has studied computational models of hearing since 1969, and also has research interests in music perception and virtual reality.

Malcolm Crawford, B.Sc.

Graduated in psychology at the University of Sheffield. Currently a Research Associate working on object-oriented blackboard architectures for auditory scene analysis.

Phil Green, B.Sc., Ph.D.

Graduated in Cybernetics and Instrument Physics from the University of Reading in 1967, and obtained a doctorate from the University of Keele in 1971. He is the founder of the speech and hearing research group at the University of Sheffield, where he is currently a Senior Lecturer in the Department of Computer Science. His research interests include the combination of symbolic, statistical, and connectionist models in automatic speech recognition.

Endeavour, New Series, Volume 17, No. 4, 1993. Olw-9327/93 $6.00 + 0.00. 0 1993 Pergamon Pw4s Ltd. Printed in Great Srltain.

186

scene analysis (ASA). In the first stage, the acoustic signal is decomposed into a number of ‘sensory components’. Subsequently, components which are likely to have arisen from the same environmental event are recombined into perceptual structures that can be interpreted by higher-level processes.

Although ASA is documented compre- hensively in the literature, there have been few attempts to investigate the phenomenon with a computer model. In this article, we describe progress on a model of auditory processing which is able to simulate some aspects of ASA. The model characterises an acoustic signal as a collection of time- frequency components, which we call synchrony strands [3], and then searches the auditory scene in order to identify components with common properties.

While our modelling studies have their own intrinsic scientific merits, they are also motivated by a number of possible applications. Firstly, the performance of automatic speech recognition (ASR) systems is poor in the presence of background noise. In contrast, human listeners with normal hearing are quite capable of following a conversation in a noisy environment. This suggests that models of auditory processing could provide a robust front-end for ASR systems.

A related point is that human listeners with abnormal hearing have difficulty in under- standing speech in noisy environments. These listeners generally have neural defects of the cochlea, and are not helped by con- ventional hearing aids which simply amplify the speech and background noise together. A better solution would be to provide an ‘intelligent’ hearing aid able to attenuate noises, echoes, and the sounds of competing talkers, while amplifying a target voice. A model of ASA could form the basis for such a hearing aid.

Other applications of this work lie in the field of music processing. An example is provided by the transcription of recorded polyphonic music, for which it is necessary to identify how many notes are being played at a particular time, and to which instruments they belong. A model of ASA could provide the basis for an automatic transcription system by performing this segregation. Such a system could be faster and more accurate than manual techniques, and would provide an efficient means of transcribing

recorded music which is not notated (e.g. much of folk, ethnic, and popular music). Also, a transcription system would provide feedback in music teaching, allowing a player to compare the transcription of his performance with the original score. Some early work on this is reported in G.J. Brown and M.P. Cooke [4].

Auditory scene analysis In his book, Bregman [2] makes a distinction between two types of perceptual grouping: namely primitive grouping and schema- driven grouping. Primitive grouping is driven by the incoming acoustic data, and is probably innate. In contrast, schema- driven grouping employs the knowledge of familiar patterns and concepts that have been acquired through experience of acoustic environments.

Many primitive grouping principles can be described by the Gestalt principles of perceptual organisation. The Gestalt psychologists (e.g. K. Koffka [5]) proposed a number of rules governing the manner in which the brain forms mental patterns from elements of its sensory input. Although these principles were generally described first in relation to vision, they are equally applicable to audition. A potent Gestalt principle is common fate, which states that elements changing in the same way at the same time probably belong together. There is good evidence that the auditory system exploits common fate by grouping acoustic components that exhibit changes in amplitude at the same time. Similarly, grouping by harmonicity can be phrased in terms of the Gestalt principle of common fate. When a person speaks, vibration of the vocal cords generates energy at the fundamental frequency of vibration and also at integer multiples (harmonics) of this frequency. Hence, the components of a single voice can be grouped by identifying acoustic components that have a common spacing in frequency.

We now describe the processing carried out by the peripheral auditory system, and a computer model which stimulates some aspects of auditory scene analysis.

Auditory representations The peripheral auditory system - consisting of the outer, middle, and inner ear - serves to transform acoustic energy into a neural

code in the auditory nerve. Rather than describe the detailed mechanisms of the periphery, some of the observed properties of this neural code will be reviewed here. J . 0. Pickles [6] provides a good introduction to the structure and function of peripheral auditory physiology.

Individual auditory nerve fibres are frequency-tuned; that is, each fibre will respond maximally (at its highest tiring rate) when presented with a tone of a specific frequency, with a fairly rapid fall-off of response as the frequency of the tone moves away from this ‘best frequency’. To a first approximation, then, the periphery is often considered to act as a bank of overlapping bandpass filters. Fibres also exhibit a high rate of firing at the onset of a tone. This ‘onset response’ dies away rapidly during sustained stimulation, and, at the offset, the rate drops below the ‘spontaneous rate’ of the fibre (that is, the rate at which the fibre fires in the absence of stimulation). One way to characterise the auditory code is in terms of the average firing rate across the range of frequencies represented by the fibres. For uniform random noise stimulation, this response is not flat - the outer and middle ears serve to boost frequencies in the mid- range (l-7 kHz).

Recent studies suggest that.a rate characterisation is not sufficient to explain the perception of everyday sound sources, since the rate response of each fibre saturates quite quickly. For speech perceived in the ‘cocktail party’ situation, most auditory nerve libres could reach saturation, so the profile of their response across frequency would appear to allow very little discrimi- nation of different speech sounds. An alternative characterisation of fibre responses is provided by examining their fine time structure.

Auditory nerve fibres show a tendency to fire in phase with individual frequency components present in the stimulus. This ‘phase-locked’ response could be used to provide information about which frequency components are dominant even at high sound levels where the rate response has saturated.

It should be noted that this characterisation remains speculative. The detailed response of the periphery is still debated, and passive, linear models are giving way to active, nonlinear simulations. Furthermore, while most fibres saturate at fairly low intensities, smaller numbers of fibres characterised by a very low spontaneous firing rate saturate at much higher levels. It is certainly too early to state with certainty and precision exactly how our auditory abilities result from the transformations carried out in the periphery.

Beyond the periphery, this neural code is further elaborated at various loci en route to the auditory cortex. Subsequent stages of processing are probably aimed at enhancing various sound properties, such as frequency transitions and amplitude-modulation rates of components. These appear to be represented in ‘maps’ - arrays of cells which are arranged in two or more dimensions,

with frequency and another parameter represented on orthogonal axes. The value of the parameter at a particular frequency is indicated by the firing rate of the cell at the appropriate position in the neural array. Some of our work is directed at modelling auditory maps, and indeed we have pro- duced a separation system which uses these representations (Brown [7]). An alternative approach taken in our group is to speculate about the sort of information which the auditory system might need to solve the problem of auditory scene analysis. The following section describes a model based on forming data abstractions which make explicit aspects of the neural code.

A computational model of auditory scene analysis Our approach follows from D. Marr’s computational theory of vision [8], which argues for a series of rich representations, each of which makes some implicit organisation in the previous level of representation explicit. Of course, such speculative representation-forming should be guided both by auditory limitations and by the results of psychoacoustic experiments. One such representation, which we have called ‘synchrony strands’, makes explicit the temporal evolution of synchronised responses in modelled nerve fibres. This representation then allows a direct application of several of the Gestalt grouping principles mentioned above.

Figure 1 depicts the stages in synchrony strand formation. Initially, the signal is processed by a model of the auditory periphery. This consists of a gammatone filterbank (R.D. Patterson et al. [9]), with

I fil

ii

signal

inst. spike frequency rate

example channel

250 filters spaced equally along an auditory scale to cover the range 50-5000 Hz. Next, the instantaneous frequency and envelope at the output of each filter are computed. The envelope forms the input to a model of the inner hair cell [lo]. Its output reflects an averaged firing rate in the auditory-nerve tibre whose best frequency corresponds to the centre frequency of the processing channel. The instantaneous frequency is median smoothed within a window of 10 ms and downsampled to 1 ms estimates. The frame of estimates across all channels is then processed to determine contiguous ranges of channels (place-groups) which appear to be responding to the same stimulus component. For each set of channels grouped in this way, an estimate of dominant frequency and overall firing rate is computed. Finally, place-groups are tracked across time using a weighted linear approximation to the track, with weights chosen such that frames nearest to the ‘aggregation-boundary’ have most effect. This results in a collection of synchrony strands.

Figure 2 shows a collection of synchrony strands for a synthetic syllable ‘ru’ used in the work of C.J. Darwin [ll]. Each line represents a single strand, with line width denoting instantaneous amplitude at that location in time and frequency. Strands below 1 kHz represent harmonics of the fundamental frequency (110 Hz in this example), while those above 1 kHz denote ‘formant’ frequencies for this syllable. Formants are manifestations of main acoustic resonances in the vocal tract, and the location of the two or three lowest frequency formants is thought to be the main determinant of vowel identity, for instance.

instantaneous frequencies

/,

* Time

temporal aggregation of place groups

Figure 1 Stages in synchrony strands production.

187

xmo

1000

a

0 100

Figure 2 Synchrony strands for synthetic syllable Vu’.

200

Frame 300

What is noteworthy is that the amplitude modulation present on those strands above 1 kHz is correlated. Along with harmonicity, this is a potential cue for grouping these components together.

A blackboard architecture for computational auditory scene analysis An important consequence of forming a compact description of the auditory code in the form of synchrony strands is that it facilitates the application of symbol- processing strategies for auditory scene exploration. Each strand can be thought of as an object which knows its start time, end time, and frequency and amplitude at each point along its length. What is required is a computational architecture which allows these objects to form groups based on organ- isational principles such as harmonicity and common amplitude modulation. Previous work in the group has adopted ad hoc architectures for discovering such groups, but our recent work is cast in the framework of a more powerful computational metaphor - the blackboard (e.g. I.D. Craig [12]).

A blackboard system can be thought of as a group of independent experts able to communicate only by manipulating entries on a globally accessible data structure (the blackboard). The experts’ actions are co- ordinated by a scheduler: each expert provides an agenda of actions they would like to perform based on the data on the blackboard; the scheduler then chooses

which operation will be carried out next. Our blackboard implementation operates in an object-oriented fashion.

We currently use two knowledge sources - HarmonicGrouper and Amplitude- ModulationGrouper - whose aims are to find strands overlapping in time which are related by harmonicity and amplitude modulation, respectively. When a group is found (as a result of one execution cycle) a new Group object is created which contains a list of the members, and is added to the blackboard. The blackboard is con- nected to a display, so that the results, and progress, of this processing can be observed.

Figure 3 displays a typical screen layout for a blackboard session. The strands shown represent a mixture of speech with an artificial siren (the dominant wavy line). The tool allows the user to interact with the blackboard and to experiment with grouping strategies. One of the main motivations for casting auditory scene exploration in a blackboard architecture is the complexity of interactions between grouping principles which can occur in practice. The architecture can handle both top-down control strategies as suggested by schema-driven grouping mentioned above, in addition to bottom-up primitive grouping.

Figure 4 shows the results of processing a mixture of a speech utterance ‘I’ll willingly marry Marilyn’ and a telephone ringing. Here the harmonic group discovered by the system is highlighted. The tool allows arbitrary subsets of strands to be resyn-

thesised for aural evaluation. More quan- titative performance evaluation is possible. For instance, we can determine how much of some target source is recovered from the mixture and thereby compare the utility of different grouping cues. Some results are reported in Cooke [lo].

Future work The model described here has a number of limitations. Firstly, the auditory system uses at least 12 cues to solve the scene analysis problem. Thus far, we have only implemented a small number of these; hence, we would not expect the performance of the model to match that of a human listener. Other knowledge sources are currently being developed, for simultaneous and sequential grouping, and top-down processing.

Although the time-frequency nature of the synchrony stands representation allows the grouping of components that overlap in time, the model is currently unable to group components that are widely spaced in time, such as a series of voiced-unvoiced speech sounds or a series of notes from a musical instrument. Some preliminary work on this problem is reported in Brown and Cooke [4], where timbre differences have been used to group musical sounds over time. A Sequential Grouper (or a number of Sequen- tial Groups each considering different aspects of sounds) will determine whether groups separated in time have sufficiently similar characteristics that they could have arisen from the same source. If so, these

Indi

vidu

al

stra

nds

can

be

insp

ecte

d -

they

ca

n al

so

be

mov

ed

in

time

and

frequ

ency

(m

istu

ned)

The

colo

ur

Of

indi

vidu

al

stra

nds

and

grou

ps

can

be

set

by

the

user

Rep

rese

ntat

ions

m

ay

be

plot

ted

on

HZ.

ER

R+-

rate

or

Bark

fre

quen

cy

scal

es

Arbi

trary

se

lect

ions

of

st

rand

s m

ay

be

resy

nthe

sise

d,

and

the

soun

d fil

e sa

ved

if de

sire

d

Scre

ensh

ot

show

ing

the

Blac

kboa

rdSt

rand

s ap

plic

atio

n

Stra

nds

with

th

e ss

me

colo

urbe

long

to

th

e sa

me

grou

p

The

Gro

ups

Insp

ecto

r vu

ns"

the

blac

kboa

rd,

allo

ws

the

user

to

in

spec

t gr

oups

an

d se

t th

resh

olds

fo

r th

e va

rious

kn

owle

dge

sour

ces

Rep

rese

ntat

ions

may

be

su

perim

pose

d ov

er

spec

trogr

ams

or

coch

leag

ram

s of

th

e si

gnal

The

outp

ut

of

the

blac

kboa

rd's

ex

ecut

ion

cycl

e ca

n be

tra

ced

The

sign

al

is a

spe

ech

utte

ranc

e (“

I’ll

willi

ngly

m

arry

M

arily

n”)

mix

ed

with

a

sire

n

Gi

Figu

re

3 G

roup

ed

stra

nds

for

tele

phon

e+sp

eech

m

ixtu

re.

Hig

hlig

hted

st

rand

s re

pres

ent

a ha

rmon

ic

grou

ping

fo

und

by t

he

mod

el.

(D

4ooo

3ooo

t

Ii

1000

C

0 1000

Frame

Figure 4 Screen display of the blackboard environment for interactive auditory scene exploration.

groups will be joined to form a source description.

As noted previously, however, one of the main motivations for the blackboard architecture is that it facilitates top-down processing. The blackboard is a flexible architecure, which allows primitive and schema-driven (learned) principles to influence the groups that are formed. Currently, however, only primitive grouping principles have been implemented. Learned principles could be added to the model by training neural nets to recognise particular timbres, on the basis of primitive information computed by early stages of processing in the model.

A further limitation is that our current system is ‘monaural’. Listeners tend to group sounds that are perceived as originating from the same location in perceived space: this is achieved by computing timing and intensity differences at the two ears. By making our system ‘binaural’ we would increase its capabilities considerably.

Another challenge is trying to get the system running in real time. Currently, the time needed for processing is unacceptable for applications such as intelligent hearing aids and real-time automatic speech recognition. However, some hope is offered by the fact that the auditory system is highly parallel. If the model were implemented on a parallel computer, we could approach real- time performance.

References [I] Cherry, E.C. Some experiments on the

recognition of speech, with one and with two ears. J. Acoust. Sot. Am., 25, 975-979, 1953.

[2] Bregman, A.S. ‘Auditory Scene Analysis’, MIT Press, London, 1990.

[3] Cooke, M.P. An explicit time-frequency characterization of synchrony in an auditory model, Comp. Speech Language, 6, 153-173, 1992.

[4] Brown, G.J. and Cooke, M.P. Perceptual grouping of musical sounds: a computational model, Interface.

[5] Koffka, K. ‘Principles of Gestalt psychology’, Harcourt and Brace, New York, 1936.

[6] Pickles, J.O. ‘An Introduction to the Physiology of Hearing’, 2nd Edition, Academic Press, 1988.

[7] Brown, G.J. Computational auditory scene analysis: a representational approach, Ph.D. Thesis, University of Sheffield, 1992.

[8] Marr, D. ‘Vision’, W.H. Freeman, San Francisco, 1982.

[9] Patterson, R.D., Holdsworth, J., Nimmo- Smith, I and Rice, P. SVOS final report: the auditory filterbank, U.K. MRC APU Report 2341, 1988.

[IO] Cooke, M.P. ‘Modelling Auditory Pro- cessing and Organisation’, Cambridge University Press, 1993.

[11] Darwin, C.J. Perceptual grouping of speech components differing in fundamental frequency and onset time, Q. J. Erp. Psych., 33A, 185-207.

[12] Craig, I.D. ‘The CASSANDRA Architec- ture Distributed Control in a Blackboard System’, Ellis Horwood Series in Expert Systems, 1989.

190

computational auditory scene analysis: listening to several things at once

Documents