a test of pre-emptiveness in speech...

64
Bregman, Williams, Walker & Ahad Page 1 of 64 A test of pre-emptiveness in speech perception Albert S. Bregman, Sheila M. Williams Bruce N. Walker, and Pierre Ahad McGill University Short title: Pre-emptiveness in speech perception Contact author: Albert S. Bregman Dept. of Psychology McGill University 1205 Doctor Penfield Avenue Montreal, Quebec Canada H3A 1B1 E-mail: [email protected] Phone #1: +1 (514) 398-6103 Phone #2: +1 (514) 484-2592 Facsimile: +1 (514) 398-4896 1

Upload: others

Post on 03-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Bregman, Williams, Walker & Ahad Page 1 of 64

A test of pre-emptiveness in speech perception

Albert S. Bregman, Sheila M. Williams

Bruce N. Walker, and Pierre Ahad

McGill University

Short title: Pre-emptiveness in speech perception

Contact author:

Albert S. Bregman

Dept. of Psychology

McGill University

1205 Doctor Penfield Avenue

Montreal, Quebec

Canada H3A 1B1

E-mail: [email protected]

Phone #1: +1 (514) 398-6103

Phone #2: +1 (514) 484-2592

Facsimile: +1 (514) 398-4896

1

Bregman, Williams, Walker & Ahad Page 2 of 64

Acknowledgements.

The authors acknowledge support from the Natural Sciences and Engineering Research

Council of Canada.

Authors Note

BNW carried out Experiment 1 as an undergraduate thesis in 1993, and is now at the

School of Psychology and College of Computing, Georgia Institute of Technology,

Atlanta Georgia. SMW carried Experiments 2 and 3 as a postdoctoral fellow in 1997, and

now resides in Doncaster, UK. PA was technical supervisor, and is now at Digivox in

Montreal, Quebec, Canada. ASB did the overall design, supervised the research, and

wrote the present paper.

2

Bregman, Williams, Walker & Ahad Page 3 of 64

Abstract

The “pre-emptiveness” hypothesis claims that perceiving a cluster of acoustic

components (a complex) as a word is carried out by a specialized “speech module”,

which suppresses the perception of the components as independent sounds. This

hypothesis was explored using single words of sine-wave speech (SWS) as the

complexes. Separate groups of adults were trained to identify these stimuli as either

words (Speech group), the cartoon sound of an event in a virtual reality video game

(Event group), or as individual acoustic components to be counted (Analytic group).

Then their ability to detect one of the sine-wave components of the signal was evaluated

immediately after they identified the cluster according to their training (on the same trial).

The pre-emptiveness hypothesis implies that the Speech group (whose speech module

would have pre-empted some stimulus energy in perceiving the words) would be less

able than the Event group to hear out the sine-wave components. However, their actual

mean performances were almost identical. The Analytic group performed slightly (non-

significantly) better. A control experiment ruled out the possibility that the lack of a

deficit for the Speech group resulted from their not actually hearing the complexes as

speech. No support for pre-emptiveness was found.

3

Bregman, Williams, Walker & Ahad Page 4 of 64

Liberman and others have argued that there are at least two different auditory systems,

one that deals with the phonetic identities of speech sounds, and another that builds a

general-purpose representation of the locations and properties of the distinct non-speech

sounds present in the input (Liberman, 1982; Liberman, Isenberg, & Rakerd, 1981;

Liberman & Mattingly, 1985; Mattingly & Liberman, 1988; Whalen & Liberman, 1987,

1996). Their views are as follows: These two auditory systems are considered to be

separate brain modules. Liberman & Mattingly (1989) drew a distinction between

special purpose “closed modules” and general purpose “open modules”. They argued

that phonetic perception was provided by a closed module, specialising in the

representation of speech, whereas the process that computes the pitch or timbre of a

sound is an open module which is capable of representing any type of sound. Because

open and closed modules receive the same input, if there were no priority system,

listeners would always hear the outcome of both types of analysis. As well as hearing a

phonetic signal they would hear the separate hisses, buzzes and so on, upon which it was

based. To prevent this, the phonetic module is considered to have priority. It “pre-

empts” the information that it uses, taking what it needs, and making it unavailable to

non-speech analyses (Whalen & Liberman, 1987,1996; also Mattingly & Liberman,

1988). This process is called the “pre-emptiveness” of the phonetic module. The portion

of the sensory information left over after phonetic analysis is passed to the non-speech

system(s). However, this happens only when the intensities of some components are in

excess of what is required for the speech percept. This makes it possible for speech and

concurrent non-speech sounds to be heard at the same time when they are truly present

together.

4

Bregman, Williams, Walker & Ahad Page 5 of 64

Whalen and Liberman (1987) illustrated this pre-emptiveness. They synthesised CV

syllables as clusters of time-varying formants, and succeeded in creating tokens of /ga/

and /da/ that were different only in the initial transition of the third formant. Then they

replaced this formant transition, in each syllable, with a sinusoidal tone glide, a “cartoon”

of the original formant transition. The intensity of this glide was varied, over a series of

stimuli, in both the /da/ and /ga/ syllables. At lower intensities, the correct syllable (/da/

or /ga/) was heard, demonstrating that the tonal glide was being used by the phonetic

module. However, above an intensity referred to as the “duplexity threshold”, both the

correct syllable and an additional non-speech chirp (derived from the tonal glide) were

heard. The researchers argued that at these higher intensities, the phonetic module had

more than the amount of energy that it needed from the glide to construct the phonetic

percept and the excess was passed on to the non-speech system, where it was interpreted

as a chirp. This was viewed as a direct demonstration of the pre-emptiveness of the

phonetic module. However, Bailey and Herrmann (1993) failed to find any range of

intensities of the tonal glide in which the syllable percepts were reliably distinguished but

the transition component could not be identified. They concluded that the findings of

Whalen and Liberman (1987) provided no real evidence that a “speech module” took

precedence over other auditory perceptual processes.

The purpose of the research reported here was to test the pre-emptiveness hypothesis by

comparing the ability of listeners to hear out the components of speech vs. non-speech

sounds. In comparing the processing of speech and non-speech signals, the choice of

stimuli is crucial. If these signals differ acoustically, this acoustic difference, rather than

the speech vs. non-speech status of the stimuli, may be responsible for the results.

5

Bregman, Williams, Walker & Ahad Page 6 of 64

Therefore the speech and non-speech signals must be identical. This goal can be

achieved by employing, as a stimulus, a cartoon of speech that the listeners may or may

not interpret as speech, depending on their biases. We chose what has been called “sine-

wave-analogue speech”, or simply “sine wave speech” (SWS) (Bailey, Dorman, &

Summerfield, 1977; Remez, Rubin, Pisoni, & Carrell, 1981; Remez, Rubin, Berns, Pardo,

& Lang, 1994). We manipulated the listeners’ bias as to whether they heard the signal as

speech or non-speech.

It is known that it is possible to bias listeners toward hearing the SWS signal as speech

by telling them that it is a form of speech. For example, Liebenthal, Binder, Piorkowski,

& Remez (2003) required participants to ‘hear out” one of the tonal components of a

SWS stimulus either when they were still naïve to the fact that it could be interpreted as

speech or after they were given instruction and practice in hearing it as speech. They

reported that this training interfered with the ability to hear out one of its tonal

components. This seemed to support the idea that the speech module is “pre-emptive’.

However, while the use of the same participants in the two conditions permitted more

powerful statistical tests to be used in this experiment, it confounded the bias of the

participants with the order of conditions, since the speech-biased condition always

followed the naïve condition. Furthermore, the interference was found in only the first

of two blocks of trials that the participants received after their training in hearing the

tonal complexes as speech. In the second block, the participants were able to perform as

well as they had before the speech-biasing training, and they performed as well or better

when the stimuli were SWS words as when they were tone-glide clusters that were

impossible to interpret as words (presumably the hypothesised phonetic module would be

6

Bregman, Williams, Walker & Ahad Page 7 of 64

evoked in the former stimuli but not the latter). So except for the noted difference, the

results expected by a pre-emptiveness theory were not found. It is possible – even likely

– that the results that Liebenthal et al (2003), found in the first block after speech

training, were not caused by an obligatory activity of a phonetic module in which it ‘used

up” some of the information, but was due to a distraction of the participants by their

attempts to take “peeks” at the phonetic identity of the stimulus while concurrently trying

to hear out an embedded component. In this manner, the training may have turned the

task into two concurrent ones for the participants, but by the second block of trials they

were able to overcome this distraction.

Also relevant to the pre-emptiveness hypothesis are the results of Experiment 2, Task 2,

of a paper by Remez, Pardo, Piorkowski, and Rubin (2001). In one of their conditions,

the participants first heard a tone that was (or was not) identical to the second formant

(F2) of a SWS word, and then heard a cluster of concurrent tonal glides (referred to as a

complex) that formed the full SWS word They were asked (a) whether the tone was a

component of the complex, and at the same time, (b) had to decide whether the complex

was the same word as a printed word. The participants were told that neither task was

considered primary and they could make the two responses in any order. This last

instruction would have permitted the participants to switch between one of Liberman’s

proposed modes and the other (phonetic versus general-purpose auditory) to make the

two judgements required by the task. Even if pre-emptiveness held true in natural

listening situations, in the highly structured repetitive task of an experiment, the

participants ― in order to satisfy the demands of the task ― might be able to first orient

themselves to the signal as a cluster of tonal glides, judge the presence of the target glide,

7

Bregman, Williams, Walker & Ahad Page 8 of 64

then switch modes for the more automatic task of perceiving it as a speech sound, using

the decaying echoic image of the sound to make the decision about the SWS word’s

identity. This strategy would allow them to escape the effects of pre-emptiveness (if they

existed). If this sort of switching between task orientations is possible, the experiment by

Remez, Pardo, Piorkowski, and Rubin (2001) cannot be taken as a decisive test of pre-

emptiveness.

Also, in their experiment, some participants started off by not hearing the complexes as

words. Later they were encouraged to hear them as words. The aspect of the results that

concerns us here is that the participants who heard the complex as a word could tell

whether the F2 tone was present in the complex slightly better than those who did not

hear it as a word, contrary to what would be predicted by the “pre-emptiveness” theory,

and also contrary to the conclusions of Liebenthal et al. (2003). However, the

researchers never tested statistically whether performance was different in these two

conditions. They only established that each was significantly different from zero. They

concluded that the results indicated the separate nature of “early phonetic and auditory

organization,” despite the fact that there was nothing in the results to indicate that they

were due to “early” organization.

Like the experiments of Remez et al. (2001) and Liebenthal et al. (2003), the present

experiments presented an SWS stimulus and asked participants to hear out one of its tonal

components. However, we took a different approach to the comparison of speech and

non-speech biases toward the stimulus. Following a method first used by Bregman and

Walker (1995), we divided the participants into three separate groups. All groups

received training in how to interpret the SWS stimuli, but each received a different type

8

Bregman, Williams, Walker & Ahad Page 9 of 64

of training. One group was trained to hear them as words (Word group). a second group

(Event group), as sounds made by events in a virtual-reality game, each of which

abstractly represented a particular type of real event (such as “spaceship door closing”).

This type of training caused participants to interpret the SWS tone-complex holistically,

i.e., as a unit, just as the Word group did, but a unit that was not verbal. A third group

was trained to hear the stimuli as clusters of tones by presenting the component tones

sequentially and requiring the listeners to count them (Analytic group). So in all cases,

the participants received training. In the Event and Analytic groups, we were able to

successfully train listeners not to hear the SWS signal as speech – despite repeated

presentations – by training them to hear it as something else.

In all three groups, participants were asked to make a judgement that tested their

awareness of the sinusoidal components of the stimuli. If the pre-emptiveness

hypothesis is correct, and if phoneme perception is a closed module, those who are biased

to hear the signal as speech should be less able to judge the properties of its individual

components.

The present study differed from that of Remez, Pardo, Piorkowski, and Rubin (2001) in

another important respect: we took some care to discourage the strategy of switching

between task orientations (or “modes”) during the trial (Experiment 1 and 2) and even to

make it virtually impossible (Experiment 3).

Outline of the three experiments. The stimuli were nine sine-wave words. In

Experiment 1, using a between-groups design, we presented these signals to listeners in

the Word group, the Event group, and the Analytic group. Our test for pre-emptiveness

required two tasks to be carried out on each trial: (1) to first identify the tone complex

9

Bregman, Williams, Walker & Ahad Page 10 of 64

according to the condition on which they had been trained (“identification task”), and (2)

to compare one of the sinusoidal components of the sound to a standard, and judge

whether it was the same or different (“component-matching task”). Apart from requiring

a fixed order of report, this task was similar to that employed by Remez et al (2001). The

goal was to determine the effects, if any, on the component-matching task, of listening

either in (a) a holistic speech orientation, (b) a holistic non-speech orientation, or (c) an

analytic non-speech orientation.

Although it turned out that, in Experiment 1, there were no significant differences in the

ability of the three training groups to identify one of the sinusoidal components of the

tone complex, it might be argued that the training had been ineffective. Specifically, one

might argue that the speech versus non-speech training in the first experiment actually

produced the learning of simple rote associations between the sounds and the names

assigned to them, so that the Word training did not really evoke the hypothesised

phonetic module. In Experiment 2, to show that the training actually caused the

participants to interpret the properties of the signals as appropriate for a particular word

or a non-speech event, we trained them to learn either inappropriate associations between

sounds and labels (e.g., the label “shook” to the SWS derived from the word “coop”) or

appropriate ones (e.g., the label “shook” to the SWS derived from the same word) and

found inappropriate labels to be much harder to learn than appropriate ones. This

experiment also showed that the speech training yielded positive transfer to hearing new

signals as speech .

Our final experiment had two goals: (a) It tried to minimise mode-switching (if any)

within a single dual-task trial; (b) It also investigated pre-emptiveness by varying the

10

Bregman, Williams, Walker & Ahad Page 11 of 64

levels of intensity of the to-be-matched component, in order to find some level at which

the energy left behind, after the phonetic module took what it needed, would be

sufficiently low that there would be a detectable level of interference with the

component-matching task.

Experiment 1

Three conditions of training. In this experiment we used a dual task: identification of the

sound as a whole and identification of one of the sinusoidal components of the complex

tone.. First we trained listeners to hear the ambiguous sine-wave speech sounds (which

we will call “complexes”) in three different ways: (1) The Word group was told that the

complexes were computer-synthesised words. They might sound strange, but, if the

participants listened closely, it would be possible to hear the words. (2) The Event group

was told that the complexes were really computer-generated versions of real-world

sounds that were to be used in a virtual-reality game. They might sound strange, and

indeed the participants might never have heard some of the sounds in reality, but if they

listened carefully, they would be able to hear the events that the sounds represented. (3)

The Analytic group was asked to count the components present in each sound.

The test for pre-emptiveness was a component-matching task; the participants decided

whether the second lowest sinusoidal component of the complex (“ the “target

component”) was exactly the same as a comparison sound. When it was not, it was the

second lowest component from one of the other complexes in the set.

Derivation of the virtual-reality event labels. Most naive listeners just hear the

complexes as groups of sounds. However, after practice, the stimuli can be heard as

events such as water dripping into a sink, or a spaceship door opening. The non-speech

11

Bregman, Williams, Walker & Ahad Page 12 of 64

descriptions for the complexes were derived, in pilot testing, by presenting them to

listeners naive to SWS. They were asked to say what they heard, regardless of how

abstract it seemed. A number of these descriptions were then presented to other listeners

to select the best ones. In this way, descriptive labels for all of the complexes were

chosen (see Table 1). What was essential to our purpose was not realism (i.e. that the

participant hear the exact sound that the description portrayed), but rather that the

participant could be biased to hear the sound as some united whole other than a speech

sound. Later, it was clear from debriefing of the participants after the experiment that

this manipulation was successful.

(Table 1 about here) Stimuli. The stimuli were a set of nine monosyllabic SWS words created by Robert

Remez and Philip Rubin who kindly supplied the parameters for them to be resynthesized

in our laboratory. These were the same sounds used by Remez et al. (2001).

Prediction. Assume that there is, indeed, a special speech module, such as that proposed

by Mattingly and Liberman (1988) that both (a) unites the component sounds of a speech

signal into a higher-order entity, and (b) pre-empts the signal and passes along, to the

general-purpose, auditory scene-analysis (ASA) system, only the acoustic information

that remains after speech-relevant information is removed. If this assumption is correct,

then when participants are biased to hear the sine-wave complex as words, their ASA

systems should be left with less information than the ASA systems of listeners who are

biased to hear them as non-words or as clusters of tones. Therefore the word-biased

listeners should have greater difficulty matching an embedded component to a standard

tone.

12

Bregman, Williams, Walker & Ahad Page 13 of 64

Method

Participants. Fifty-two paid participants (20 male, 32 female) with a mean age of 21.9

years and a range of 18 to 41 years, were recruited. All read and signed a Consent Form,

authorized by McGill University ethics procedures, which indicated the nature of the

experiment; they were told they could quit the experiment at any time (this ethics

procedure was carried out for all experiments reported in this paper). No participant

reported any hearing impediment. All indicated English as their best language. Results

from two were excluded due to procedural errors.

Stimuli. In each of the tasks, the sounds that the participants heard were identical for all

participants. They consisted of nine tonal complexes, which were the SWS versions of

nine words: beak, sill, wed, pass, lark, rust, jaw, shook, and coop. Each consisted of

three or four gliding sinusoidal tones, each replicating the frequency trajectory of one of

the lower numbered formants of the corresponding word. There were two kinds of

stimulus sounds: (a) clusters of tonal glides (called “tone complexes”), each consisting of

the three or four components constituting a SWS word, and (b) a single sinusoidal

component (a varying tone), to be used as a standard, consisting of the second “formant”

of one of the SWS words.

There were also nine visual stimuli, used in Block 2 of trials (described below), simple

shapes made up from asterisk characters, and presented on a computer video display.

All stimuli are described in Table 1. Each column pertains to one of the SWS complexes.

Working from top to bottom, each column shows: (a) the identification number of the

tonal complex, (b) the visual shape linked to that complex in Block 2, (c) the

interpretation of the complex as a word, (d) its identity as a game event, (e) the number of

13

Bregman, Williams, Walker & Ahad Page 14 of 64

components in it. Words in adjacent columns contain vowels that are only one step away

from one another in a similarity measure defined by Remez, Pardo, and Rubin (1992).

The vowels in words that are two columns apart are less similar, and so on. Columns 1 to

9 are to be read as a circular sequence, with Columns 9 and 1 considered to be adjacent.

Procedure. The experiment was divided into five blocks, each consisting of a different

training procedure or test (details given below). For the first two blocks the three groups

were not yet differentiated by training, and the same tasks were given to all participants.

They included a component-matching test alone and then the same task concurrently

with a visual task (see details below). The concurrent visual and component-matching

task of Block 2 was designed to assess individual differences among participants in their

ability to carry out two recognition tasks concurrently. The scores from this task were

used as covariates to increase the power of the statistical tests so that they approached the

efficiency of a within-subjects design.. Block 3 introduced the main independent

variable, a training procedure to bias the participant to interpret the complex in one of the

three ways described earlier. Block 4 verified that the training was successful, and

topped it up if necessary. Finally, Block 5 was the criterion task. It was designed to do

two things: (a) to induce the listening bias that the individual listeners had been trained

on, and (b) to concurrently test their ability to perceptually isolate the target component

of the tone-complex (the “component-matching task”).

Any test blocks that involved component-matching proceeded as follows: On each trial,

the participant saw a message and then heard two sounds, first a single component (the

Tone) then a full sine-wave speech sound (the complex). This was followed by the

14

Bregman, Williams, Walker & Ahad Page 15 of 64

question(s) to be answered. One question was whether the Tone was one of the

components of the complex.

The Tone was always the second-lowest component of one of the complexes. On half the

trials, where the correct answer was Yes, the Tone was from the complex presented right

after it. On the other half of the trials, where the correct answer was No, the Tone was

selected from a different complex, namely the one in the immediately adjacent column of

Table 1. For half the participants, this was the column one step to the right, and for the

other half, the column one step to the left (recall that the table is considered to be

circular).

Prior to Block 3, all of the participants had heard the same stimuli, and had performed the

same tasks. They had not yet been biased to hear the complexes as speech, or as anything

other than a collection of strange sounds. The participants had been randomly assigned

to one of the three different bias conditions (Speech, Event or Analytic), which were

implemented in Blocks 3 and 4. These blocks were the training session and verification

of training that were used for setting the listening mode desired in the final block.

Procedural details.

Block 1. Tone-segregation task alone. The purpose of this task was to provide a baseline

for performance and to familiarise the participants with the task. The only response

requested was whether the Tone was present in the complex. This block of trials

consisted of 4 presentations of each of 18 conditions: 9 complexes and 2 relationships of

Tone to Complex (Match or No-Match), giving 72 trials in all, in sequences randomised

independently for each participant and each task block. The timing was as follows:

15

Bregman, Williams, Walker & Ahad Page 16 of 64

Visual warning (message), 2000 ms delay, play Tone, wait 1000 ms, play Complex, wait

500 ms, ask question and get response, wait 700 ms before next trial.

Block 2. Tone-Segregation concurrent with visual recognition. On each trial, one visual

stimulus, selected from the set of 9 shown in Table 1, was presented on the screen. Then

a Tone, followed by a Complex was presented. Then, after 1500 ms, a second shape

selected from Table 1 appeared, and the participant had to answer two questions. The

first asked whether the second shape matched the first; the second asked whether the

Tone was present in the Complex. This second question was the same as in Block 1. The

timing was as follows: Display visual shape, wait 2000 ms, play Tone, wait 1000 ms,

play Complex, blank the display, wait 1500 ms, display the second shape, ask two

questions and collect the responses during a period of 3000 ms, wait 500 ms before next

trial. There were 108 trials in this task block.

Block 3. Practice in one of three biasing conditions (Speech, Event, or Analytic). For the

training block, a participant in the Speech condition, for example, would see the phrase,

“The word BEAK”, on the screen and then hear Complex 1 (the sine-wave sound derived

from the word “beak”) twice. Similarly, a participant in the Event condition would see

“The sound of a Volkswagen Horn” and then hear Complex 1 twice. Participants in the

Analytic condition would see the phrase, “A complex with 3 components”. Then they

would hear each of the components of the sine-wave complex played individually, in

ascending order, and then the whole complex twice. In this way, the whole set of 9

sounds was presented three times in different random orders for each participant.

The timing was: description of the complex on the screen (e.g., a word such as “beak”, or

a phrase such as “fireman sliding down a pole”), wait 2000 ms, play complex, wait 1500

16

Bregman, Williams, Walker & Ahad Page 17 of 64

ms, play complex again, wait 1000 ms. For the Analytic (component-counting)

condition, the timing was: describe, on the screen, the number of components, wait 1500

ms, play component 1, wait 500 ms, play component 2 ... (etc. for 3 or 4 components),

wait 1000 ms, play Complex, wait 1250 ms, play Complex again, wait 1500 ms before

next trial.

Block 4. Verification of effects of training. This block was included both to verify the

effects of the training and to top it up, if necessary. On each trial, the participants would

see one of the labels used in the training and then hear either the matching sound or the

one from the column preceding or following it in Table 1. The participants were then

asked whether the printed word or description matched the sound. Feedback was given

about the correctness of the answer. If the response was incorrect, feedback included the

correct information about the sound heard. An incorrect response inserted an additional

trial for that particular sound, in a random position in the sequence of trials. The session

continued until the participant had 3 trials in a row correct for each of the sounds.

Timing: Description (possibly false) of the Complex on the screen, wait 2000 ms, play

Complex, wait 500 ms, question, response and feedback, wait 2000 ms before next trial.

Block 5. Tone-Segregation concurrent with Identification of Complex. The final block

was the dual-task test of pre-emptiveness. This time, the participants saw one of the

labels used in the training, as in Block 4, and then heard a Tone followed by a Complex.

Then they were asked two questions: (a) The first (intended to set the listening mode)

asked whether the printed label matched the sound. (b) After collecting the response, this

was followed by a question concerning whether the Tone was present in the complex.

Timing: Description (possibly false) of Complex on screen, wait 2000 ms, play Tone,

17

Bregman, Williams, Walker & Ahad Page 18 of 64

wait 1000 ms, play Complex, wait 2000 ms, and two questions asked and responses

collected (2000 ms), wait 500 ms before next trial.

There were 108 trials in this block: The variables were (a) True and False pairings

between the description and each sound, (b) Match and No-Match conditions between the

Tone and the complex. Thus there were 9*2*2 conditions, each repeated 3 times.

Post-experimental debriefing. During debriefing, the participants were all asked

specifically what they had heard. This was to ensure that none of the participants except

those in the Speech group had heard the complexes as words. This was planned as a

criterion for participant data rejection, but no data had to be excluded by this criterion. In

addition, the participants were asked if any of the sounds (words) were more difficult to

match to the descriptions given.

Apparatus. The sounds, were digitally synthesised and presented via 16-bit D/A

converters. An output sampling rate of 20 kHz was used for all signals. The sounds were

presented diotically over headphones in a single-wall test chamber. Sound pressure

levels were measured at a fast A weighting using a flat-plate coupler. The individual

components of the complex ranged from 55 to 67 dBA, with the full complexes ranging

from 68 to 72 dBA. Visual material (messages and pictures) were presented via the

computer screen, and participant responses were registered by entering numbers via the

keyboard.

Results

Summary statistics for the five tasks, collapsed over the three bias groups, are given in

Table 2. Relative to the unaccompanied Tone-Matching task of Block 1, there was no

18

Bregman, Williams, Walker & Ahad Page 19 of 64

effect of the concurrent visual task of Block 2, but an apparently detrimental effect of the

concurrent auditory Complex-identification task of Block 5 (compare underlined

numbers in boldface). Note that chance performance is 50 percent.

(Table 2 about here)

Manova Tests

Non-concurrent vs. concurrent with visual. A within-subjects MANOVA showed that

the slight improvement in mean performance on the Tone-Matching task of Block 2

(concurrent with the visual task), relative to the same task performed alone in Block 1,

81% vs. 79%, was not significant, F (1,38) = 3.40, p = 0.069. If real, the improvement

was probably due to practice, partially cancelled, perhaps, by a small amount of

interference from the concurrent visual task. It is evident from Table 2, that the visual

task itself was extremely easy, performance reaching almost 100 per cent. It is not

surprising that the concurrent Tone-Matching task did not show any adverse effects.

Concurrent with visual vs. concurrent with complex recognition. There was a significant

drop in performance from the Tone-Matching Task of Block 2 (concurrent with the

Visual task) to the Tone-Matching task of Block 5 (concurrent with the task of

recognising the complex): 81% vs. 76%, F (1,38) = 22.4, p < .001). This is evidence that

the participants were unable to simply ignore the concurrent auditory complex-

recognition task when carrying out the Tone-matching.

Results from the tone-matching tasks. Percentage correct responses from the tone-

matching tasks and from the concurrent identification tasks of Blocks 2 (pre-bias) and 5

19

Bregman, Williams, Walker & Ahad Page 20 of 64

(post-bias) are shown in Table 3, classified according to the training bias given in Blocks

3 and 4.

(Table 3 about here) Five separate between-subjects ANOVAs were carried out without covariance, one for

each task. There were no significant differences among the three bias-groups of

participants in any of these tasks. This means that the participants in the three groups

performed similarly on the Tone Matching task, both before and after biasing. Using

planned comparisons, there was, however, a significant difference, in the Complex-

recognition task of Block 5, between the participants in the Speech condition (84%) and

those in the Analytic condition (70%). [Note that this is not the criterion task (test for pre-

emptiveness) but the mode-setting task]. After training, it was easier to recognize the

complex as a word than to judge how many components it had, F (1,38) = 4.07, p < .05);

the task of judging it as a virtual-reality Event was intermediate in difficulty; its mean

percent correct was not significantly different from either of the other two groups.

The observed, but non-significant, difference in the final Tone-Matching task between

the Speech and Event groups (the test for pre-emptiveness) went in the direction opposite

to that predicted by the theory that speech recognition pre-empts auditory information.

According to that theory, the Speech group should have done worse than the Event group

but apparently did slightly better. However, we had to reduce the likelihood that this

achievement may have been due to a random variation in Tone-matching ability between

the participants allocated to different groups.

Covariance analysis. To investigate this, and to get more precise estimates of the group

differences, if any, we employed an analysis of covariance (ANCOVA). Correlations

20

Bregman, Williams, Walker & Ahad Page 21 of 64

among all five tasks were computed, indicating, that as expected, all three Tone-

Matching tasks were highly correlated. The Tone-Matching task concurrent with the

Visual Identification task (Block 2) had the highest correlation to the Tone-Matching

concurrent with the Complex-Identification task of Block 5 (the pre-emptiveness test),

r = .63. Accordingly we used the Tone-Matching of Block 2 as the covariate for further

analysing the Tone-Matching of Block 5.

The ANCOVA showed no significant effect of Bias condition, F (2,46) = 0.38, p =.68.

The values of the percent-correct means, adjusted by ANCOVA, are even closer together

than before adjustment (Speech = 75, Event = 75, Analytic = 77). The adjusted Speech

and Event group means differ only in the 4th significant digit. The standard errors of

these adjusted means were very small (about 1.7). If we estimate the confidence interval

for the difference between the means at the 5 per cent level, it is 4.7. We also performed

a planned comparison between the Analytic group and the combined means of the two

“holistic-recognition” groups. There was no significant difference, F (1, 47) = 0.81,

p=.37).

Differences based on acoustic properties of the signals. While the biasing of participants

had a very weak or non-existent effect on the Tone-Matching task, the variation in

acoustic properties of the signals had a much larger one. Table 4 shows the mean rounded

percent correct scores for each complex (identified by its “word” label) for the three bias

groups in the three Tone-Matching tasks. The mean scores, shown in the last column,

range from 70 to 96 per cent, averaged across all three Tone-Matching tasks. This is

much larger than the range of 74 to 77 for the three bias groups (uncorrected by

covariance) in the final Tone-Matching task. Clearly some stimuli were much more

21

Bregman, Williams, Walker & Ahad Page 22 of 64

distinctive than others.

(Table 4 about here)

Discussion of Experiment 1

All three groups of participants performed significantly better on the Tone-Matching task

when it was concurrent with a simple Visual Matching task (block 2), than when

concurrent with the identification of the complex (block 5) even though the later task had

the advantage over the earlier in providing more practice. Clearly the identification of

the complex does use up some resources, but this does not appear to reflect a pre-emption

of acoustic information due to speech processing, as all the groups were affected

similarly. Both groups of participants who were encouraged to hear the complex as a

composite whole were affected to the same extent, although only one group (the Speech

group) heard the sounds as speech. The slight advantage in the criterion task for the

third group (Analytic bias) was not significant and was probably due to the fact that they

were the only group to be exposed to a decomposition of the complex (they heard each

component separately as well as hearing the complexes) during training.

Despite the significant detrimental effect of the concurrent task, it is of course possible

that we somehow failed to bias our participants, the interference being simply due to the

cognitive overhead of the concurrent task. It has been proposed by a critic that the

biasing task merely constituted a paired-associate learning task in which participants

learned an association between a sound and a label rather than actually perceiving the

sound as a word or a VR event. A second potential criticism involves the possibility

that, at the presentation intensities used in this experiment, the speech module needed a

very small proportion of the signal, leaving a more than adequate residue for auditory

22

Bregman, Williams, Walker & Ahad Page 23 of 64

processing in the Tone-Matching task. We address both these possibilities and others in

Experiments 2 and 3.

Experiment 2

Test for paired-associate learning. In Experiment 1, in order to equate the training

methods between the three bias groups as closely as possible, the same procedure had

been applied to all of them. This involved presenting a message on the computer screen,

together with a presentation of the sound. It is conceivable that this technique allowed

participants to learn the sounds, for all three conditions, using paired-associate learning.

If this were the case, then the lack of differences in pattern-matching might be

attributable to the application of the same (paired-associate) listening mode.

In Experiment 2, we tested this hypothesis in two ways: (a) by comparing the same

training with control training, which involved mapping the same words or descriptions to

the same set of sounds, but in an “inappropriate” way, breaking the meaningful link

between the sound and its label, and (b) by testing the transferability of the training onto

a new set of stimuli. Since paired-associate learning affects only the members of the

individual pairs that have been learned, and ones closely resembling them, any

performance on a new task involving different sounds would involve the ability to learn

new associations rather than to generalise the previously learned associations to new

sounds and labels without further training. So, under the hypothesis of paired-associate

learning, there should be no difference between the participants in the different “bias”

groups on their identification of new speech-like sounds on a later task, when no

reference was made by the experimenter to any connection with the earlier training.

23

Bregman, Williams, Walker & Ahad Page 24 of 64

For this experiment we replicated the training method of Experiment 1, extending the

minimum number of trials in the verification task to 54 in order to more accurately

evaluate the relative ease of learning between the different conditions.

Transferability. We introduced a new set of nine “vowel” sounds, each formed of three

steady-state frequency components, that required “holistic” interpretation before they

could be identified (Only the composite three-tone sounds were unique. Any individual

pure-tone component was the same as a component in at least one other of the sounds).

These sounds had component relationships similar to vowel formants although not

necessarily the same as in any of the SWS complexes used in the training.

Method

Participants. There were 114 paid participants, mostly young adults, recruited from a

university population. Data from 16 participants were rejected due to English not being

their first language or as a result of technical problems,. As a result, 98 participants, 34

males and 64 females, provided data for analysis.

(Table 5 about here) Stimuli. The sine-wave signals used for the training and the training verification phases

of label learning were the ones used in the previous experiment (9 complexes and a set of

isolated second-formant tonal glides, the latter to be used as the standards in the

component-matching task). Nine new sounds were created to measure transferability of

learning (see Table 5). Each was the sine-wave analogue for a single vowel, and was

composed of 3 steady-state pure tones whose frequencies were based on the formant

frequencies for the steady states of pure vowels (Peterson & Barney, 1952). A set of

24

Bregman, Williams, Walker & Ahad Page 25 of 64

single-tone “formants” to be used in a tone-matching task with the “vowel” stimuli was

also synthesised. The actual frequency values for the formants of the “vowel” stimuli

were adjusted so that we could work with a limited set of frequencies, in order to ensure

that no component in any complex would be unique. The complexes fell within the range

62-68 dBA. The three-tone complexes were each generated with the same simple

amplitude envelope, having a duration of 200 ms, including 50 ms onset and offset

ramps, and were 64-65 dBA in intensity.

Procedure. The procedure for training and testing each participant was controlled by a

computer and all instructions were presented to the participants via the computer screen.

The experimental session was divided into five blocks. Following some blocks, the

experimenter asked questions and the responses were recorded on the participant data

sheet. After verification of training initial learning of labels for the SWS complexes, all

participants then received another 3 tasks based on the vowel-sound set. These tasks

were the same for every participant, although the order of the two final tasks was

reversed for half of them. There was no pre-test.

The initial training and verification formed an inherent part of the experiment. No

practice examples or demonstrations of the sounds were given prior to the formal

training. Training varied according to condition (Speech, Event or Analytic) and

subcondition (the Appropriate vs. Inappropriate mapping between sounds and labels).

The experiment took around 50 minutes per participant, and was completed in a single

session.

Block 1. Practice in one of three (Speech/ Event/ Analytic) biasing conditions. Each

participant underwent only one of 9 forms of training. The main conditions, were the 3

25

Bregman, Williams, Walker & Ahad Page 26 of 64

different types of training (Speech/Event/Analytic) used in the previous experiment. The

subconditions for each type of training were 3 alternative mappings of labels to sounds.

The Appropriate subcondition used the mapping employed in Experiment 1. There were

two Inappropriate subconditions (Inappropriate-1 and Inappropriate-2), involving two

alternative mappings, which presented each sound together with the label of the sound

that (a) either preceded it by 2 columns in Table 1 or (b) that followed it by 2 columns.

For training, a label (either a word, a description of an event, or a number of components

in the sound, according to condition) was displayed on the screen while a sound was

played (twice), as in Experiment 1. Trials on the nine sounds were repeated three times

in random order. In the component-counting training, the individual components were

played as well as the composite sound.

The complexes were those of Experiment 1 and were about 300 ms in duration. The

steps in the Speech and Event trials were: Display description of complex (word label or

description such as “fireman sliding down a pole”), play auditory complex (twice), all

separated by short silences . For the Analytic (Component-counting)trials: Display

sentence describing number of components, wait 1500 ms, play Component 1, wait 500

ms, play Component 2 ... etc. for 3 or 4 components, play the complex twice, all

separated by short silences

Block 2. Verification of effects of training. During verification, each sound was presented

either with the label given during training or another label from the same set. As in

Experiment 1, we used two possibilities for the false choices, to reduce potential bias

from any one particularly distinctive alternative. For half the participants, the false

26

Bregman, Williams, Walker & Ahad Page 27 of 64

alternative labels were from the previous column of Table 1, and for the other

participants from the following column of the table.

Verification trials consisted of a message and a sound. The participant indicated whether

the displayed message matched the sound, on each trial. Feedback was given and extra

trials were added for conditions that gave rise to errors. The trials continued to a criterion

at least 3 trials in a row correct of Yes responses and 3 in a row correct of No responses

for each sound, to a maximum of 80 trials (minimum 54 trials). Trial events were:

Display description (possibly false) of the complex, play complex, ask question, record

response and give feedback, all separated by short silences.

Block 3. Tone matching within vowel complexes. For all participants, Block 3 was a

Tone-matching single task performed on the new set of three-tone “vowel” sounds. On

each trial, a warning was displayed, then the participant heard a single tone followed by a

three-tone complex and the only response requested was whether the Tone was present in

the complex. The single tone was either the centre tone from the three-tone complex or a

tone at the nearest frequency level below that, that had been used in a different complex.

The block of trials consisted of 3 presentations of each of 18 conditions (9 complexes and

2 relationships of Tone to complex – Match or No-Match). The individual components

(i.e. the target Tone or complex), regardless of individual length, were each in digital

sound files of 200 ms in duration. Trial events were Display warning message, play tone,

play complex, question and response, all separated by short silences.

Blocks 4 & 5. Vowel-identity matching and Component-counting tasks. All participants

did both a Vowel-identity matching task and a Component-counting task. Each of these

were single tasks, requiring only one response after hearing each sound. However half

27

Bregman, Williams, Walker & Ahad Page 28 of 64

the participants did Vowel-identity matching before Component-counting and half

afterwards.

On each trial of the Vowel-identity matching task, participants heard a single three-tone

complex and were asked to select by number, from a list of nine words, the word whose

vowel sound most closely matched the sound heard. The words used, in order, were:

beet, bit, bet, bat, bob, bought, book, boot, and but (see Table 5). Trial events were:

Present warning message, play complex, question and response, all separated by short

silences..

On each trial of the Component-counting task, participants heard a single sound and were

asked to indicate, using a number from one to seven, how many components they heard.

Each sound was, in fact, either one of the three-tone complexes used in the tone-matching

of block 3 and the vowel-matching task or a single (centre) component from one of these

complexes. Trial events were: Display warning, play complex, question and response, all

separated by short silences.

Design. We used two different mappings of inappropriate interpretations to complexes

so as to reduce the potential bias arising if any one pairing was especially distinctive. We

tested only half as many participants in these two groups as in the appropriate-pairing

groups, as the former were to be combined in the statistical analysis. The variables were

3 Conditions (Speech, Event, and Analytic (component counting), 3 subconditions

(appropriate pairing and two types of inappropriate pairing), 2 choices for false

alternatives (A or B) and 2 categories for Sex (M or F). Thus our design had 3*3*2*2 =

36 cells and involved 9 independent groups of participants.

All the sounds presented to each participant were the same as for any other participant

28

Bregman, Williams, Walker & Ahad Page 29 of 64

with only two exceptions. The first is that participants in condition 3, the Analytic

Training Condition, all heard the individual components of the sounds in ascending order,

preceding the presentation of each composite sound. This happened in the practice task

only. All other tasks, including the verification task, involved only the same complete

complexes as were presented to other participants. The second exception is that, within

the verification task (as in the previous experiment), participants continued until a

criterion was reached or to a maximum of 80 trials. This means that participants giving

incorrect responses had extra trials and therefore heard the sounds more times during this

task. However this was necessary in order to bring all participants to the same level or

performance. All other tasks had a predefined fixed number of trials.

Apparatus. This was the same as in Experiment 1 except that sound pressure levels were

measured at a fast B (rather than fast A) weighting. Levels are given under the Stimuli

heading.

Results

Mean percentage correct responses from each of the tasks, listed according to the training

bias given in Blocks 1-2, are given in Table 6. The conditions Speech-1, Virtual-Reality

Event-1, and Analytic-1 involve Appropriate pairings of labels with sounds.

(Table 6 about here)

A MANOVA was performed for the variables Bias (the training category) and

Appropriateness (appropriate vs. inappropriate mappings between labels and sounds) for

the four tasks shown in the columns of Table 6. The significance was evaluated using

Wilk’s Lambda (W-λ), and reported here in the form “W-λ(df,df) = <value>, p <value>”.

29

Bregman, Williams, Walker & Ahad Page 30 of 64

The effects of Bias were highly significant, W-λ(8,172) = 0.484, p < .0001, as were the

effects of Appropriateness, W-λ(8,172) = 0.469, p < .0001. There was also a significant

interaction between the two variables, W-λ(16,263) = 0.742, p = .0498. The Universal

Test for dependent variables showed that this interaction was solely due to the Label

Verification Task where F(4,89) = 3.892, p = .006; for all other tasks, F (4,89) < 1.24, p

>.30.

Planned comparisons were performed, using a general MANOVA, on the data from all 4

tasks, to test the effects of appropriate versus inappropriate mapping of training sounds to

labels within each of the training conditions (speech / virtual reality event / component-

counting). For each task, this involved a single comparison of 1 appropriate against 2

inappropriate mappings, within each bias condition. The results of these tests are given

in Table 7.

(Table 7 about here)

Table 7 shows that the verification scores (for correct association of labels with sounds)

for participants given the Appropriate mappings were significantly better than for those

given Inappropriate mappings (the percent-correct means are shown in Table 6). This

held for each of the three training conditions, shown in different rows of the table

(p < .0001, in every case), leading to rejection of the hypothesis that training simply

produced paired-associate learning of labels to sounds.

For the vowel-matching task performed by the participants who had received Speech

training (Table 7, top row), the score for the group given the appropriate mappings was

also significantly better, at the 5% level, than that for participants given inappropriate

mappings, providing further evidence that participants with appropriate mappings had

30

Bregman, Williams, Walker & Ahad Page 31 of 64

learned to hear the complexes as words, and showing that this continued to have an

effect throughout the experiment, not just for the stimuli on which this difference in

training had been given.

The tone-matching task was given in order to further investigate the transferability of the

learning effects and these results confirmed the distinction found between the synthetic

(word/event) and analytic (component-counting) training that had been found in

Experiment 1.

(Table 8 about here)

Table 8 shows the results of planned comparisons testing the effects of the nature of the

training (Speech, Event, or Component Counting) on the various tasks, looking only at

the groups that had appropriate training (Groups 1, 4, and 7 of Table 6). For each task,

this involved three comparisons: Speech vs. Event; Speech vs. Component-counting; and

Event vs. Component-counting; therefore a Bonferroni adjustment was made on

probability values. As in Experiment 1, we see that the Speech and Event groups differed

in how well they performed in matching labels to the stimuli (Speech labelling was

easier). Furthermore they differed in how well they could match the vowel-derived

complexes to vowels (the Speech-trained group did better than the Event-trained group).

However, their earlier training had no effect on either matching a tonal component or

counting components in the vowel-derived complex, as would be expected if speech

recognition did not remove energy from ASA processes.

The vowel-matching task showed a significant advantage of the earlier Speech training

over the other two forms of training, demonstrating, again, that the learning acquired

during the biasing training of Block 1 (to hear the sounds as speech) transferred to the

31

Bregman, Williams, Walker & Ahad Page 32 of 64

vowel task. This, again argues against a purely paired-associate interpretation of the

effects of our training procedure.

Referring back to Table 7, the tone-matching task showed an almost significant effect of

Appropriate vs. Inappropriate training in the Component-counting group (p = .068). A

complementary result is shown in Table 8, in which we look only at the conditions in

which Appropriate training (Speech-1, Event-1, and Analytic-1). We observe that the

Tone-matching task was performed better by participants trained in Component-counting

than those trained in either Speech or Event interpretations, though the latter comparison

was non-significant (by a hair) after Bonferroni correction. We also observe that in

Table 6, when we consider only Appropriate training (boldfaced values), the pattern of

Verification scores in the three training groups (percent correct scores of 96.99 for

Speech training, 86.93 for Event training, and 68.75 for Analytic training) replicated the

results of Experiment 1. These differences were all statistically significant (Table 8).

Words appear to be easier to attribute to the sounds than are virtual-reality events, and the

latter easier to attribute than the number of components.

Discussion of Experiment 2

We found that the natural mappings of the sounds to words or other descriptions

contributed both to ease of learning and to the transferability of that learning to later

tasks. Also, labels for the sounds were easier to learn when they were appropriate (the

actual words from which the sine-tone complexes had been derived, or the Event labels

spontaneously given in the pilot testing, or the actual number of tones in the complex)

than when inappropriate. We also found that within the speech group, only the subgroup

32

Bregman, Williams, Walker & Ahad Page 33 of 64

that had been asked to associate the appropriate words with their sine-wave formant

analogues during training demonstrated an advantage in recognising a new set of sine-

wave vowels later. In the case of component counting, participants who were given

appropriate labels (component counts) for the complexes learned them more easily than

participants given inappropriate labels, and this difference in training also may have

affected tone-matching in new materials (the vowels), a task that (like the training)

required the participants to “hear out” a tone in a complex (although the results were not

statistically significant). However, appropriate training in component counting did not

confer an advantage in judging the number of components in new materials.

We conclude that our training method induced different perceptual biases, facilitating

phonetic/phonemic perception for the Speech group only. The results also suggested

some difference between synthetic listening (holistic) and analytic listening i.e., between

training on speech or virtual-reality events and training on component counting: as

before, only the synthetic vs. analytic distinction appeared to have an effect on the tone-

matching task.

The hypothesis that paired-association was the principle mechanism for learning the

labels for the sounds was contradicted by the relative ease of learning appropriate

mappings vs. inappropriate ones, and the relative advantages they gave to the participants

in interpreting groups of steady components as vowels.

Experiment 3

Mode switching. In Experiment 1. the Speech group did not perform significantly worse

than the Event group on component-matching when this task was concurrent with the

33

Bregman, Williams, Walker & Ahad Page 34 of 64

identification of the complex. In Experiment 3, one of our goals was to rule our mode-

switching as an explanation. It is conceivable that the speech mode does indeed exist and

is pre-emptive of the signal, but that our subjects were able to escape its effects by mode-

switching. Even though a question about the identity of the whole had to be answered

before the question about the component, the participants might have carried out the

following strategy: first they performed the harder task of component-matching (in the

ASA mode), then switched to the Speech mode and using the decaying echoic memory of

the complex to perform the much easier task of matching a label to the complex as a

whole. Hence they were not yet in the speech mode when they did the component-

matching task.

The possibility of rapid mode-switching may seem incompatible with some of the

evidence for pre-emptiveness. In testing the pre-emptiveness hypothesis, Whalen and

Liberman (1987) concluded that it required more energy for a sine wave component –

substituted for a formant, in a synthetic syllable – to be heard as a separate entity than

was required to identify the syllable itself, and that this was due to interference from a

speech specific mode or phonetic module. This happened even though no speech

interpretation was demanded concurrently from the participant. Had subjects been able

to switch modes at will, this interference would not have been obtained.

However, it might be argued by proponents of the pre-emptiveness hypothesis that the

mode-switching problem could have arisen in the present research and not in that of

Whalen and Liberman (1987). Our participants, because of their over-training on both

tasks, might have found it possible to perform one of them, and then rapidly switch

modes in order to perform the other. If some strategy permitted the component matching

34

Bregman, Williams, Walker & Ahad Page 35 of 64

task to be performed first, before entering the speech mode, the pre-emptiveness could be

avoided. This argument suggests that there might be pre-emptiveness whenever listeners

are actually in the speech mode, but that one can learn strategies for avoiding the speech

mode. We refer to this as the hypothesis of “momentary pre-emptiveness.”

Note that this criticism makes the pre-emptiveness of the speech mode a rather transitory

phenomenon. If one can switch in and out of the speech mode at will, the pre-

emptiveness of this mode would not preclude other analyses of the signal, except at those

instants in which one was actually in the mode. The hypothesis of momentary pre-

emptiveness has an implication for another claim: Liberman and Mattingly (1989) argued

that phonetic perception is a distinct module in auditory perception. However, if the

speech mode is not compulsory whenever a speech signal (or a signal that has recently

been interpreted as speech) is present, then it lacks one of the properties of a module as

specified by Fodor (1983), namely obligatory operation. So if momentary pre-

emptiveness exists, either phonetic perception is a module, but of a type different from

those specified by Fodor, or else phonetic perception is not a module at all. It was our

goal, in Experiment 3, regardless of its implications for theory, to prevent mode-

switching to the extent possible.

The duplexity threshold. Another explanation that a critic might offer for the lack of

difference between the Speech and Event groups in the component-matching task of

Experiment 1 is related to the intensity of the sounds. The stimuli may always have far

exceeded the minimum amplitude level for recognising the words, so that even if

phonetic recognition pre-empted some of the energy of the components that defined the

35

Bregman, Williams, Walker & Ahad Page 36 of 64

word, there was always plenty left for the component-matching task. To address this

criticism, in Experiment 3 we varied the amplitude levels of the tone complexes

Component-matching by a holistic strategy. It was possible that, in Experiment1, an

isolated component (used as the “target” in the Component-Matching Task) bore some

similarity to the global properties of the natural word from which it was drawn and was

therefore identified as “in the complex” . The complex might not have to be analysed into

its components at all. If so, pre-emptiveness would not show itself. To address this

problem, we did two things: (1) we chose, as the "false" alternative, the temporal reversal

of the component contained in the complex with which it was being compared. This

ensured that both correct and false alternatives were in the same frequency range and had

the same frequency and amplitude changes (although not in the same temporal order).

(2) To eliminate the use of “naturalness” as a cue for rejecting the false alternative, we

also included, as complexes, versions of the original SWS complexes in which the

second components were reversed. These altered complexes did not sound as close to the

natural-speech words on which the SWS complexes were based as did the original SWS

complexes. However, their only function on any given trial was to elicit the (presumed)

speech mode, and they did not sound so distorted as to make them unacceptable as

versions of speech. For such complexes, the “false” choice of component (the one not in

the complex presented on that trial), was actually taken from the original unaltered SWS

complex. The inclusion of these complexes ensured that an accurate component match

had to be based on the component itself and not its similarity to the natural word from

which it was abstracted.

36

Bregman, Williams, Walker & Ahad Page 37 of 64

In order to encourage participants to make their judgements about the identity of the

whole complex (by putting themselves in the speech mode) before they heard the isolated

component, they were told they would first see a label [a word, an event description or a

number of components, according to condition] then would hear two sounds [the

complex followed by a proposed component]. They were told that they were first to

answer, as quickly as possible, whether the label was the right one for that complex, and

that this response would be timed. Then they would be asked whether the final sound

had appeared in the just-heard complex. We hoped to prevent the participant from

performing the component-matching task before entering the trained recognition mode.

Presenting the proposed component after the complex made the component-matching

task much harder, but most participants still managed to perform at above-chance levels.

Amplitude levels. For the reasons described earlier, we varied the amplitude level of the

to-be-matched component to see whether the lack of difference in component-matching

ability between Speech and Event participants would persist at lower amplitude levels. It

was not practical for us to present the second component of the complexes above and

below a specific duplexity threshold for each SWS word for each participant. With nine

different sounds, and 48 subjects, it would have taken a prohibitive number of repetitions

to find the presumed 432 duplexity thresholds. Furthermore we could not merely choose

amplitudes below the duplexity thresholds calculated from research on duplex perception

(e.g., Whalen and Liberman, 1987, 1996; Bailey and Herrmann, 1993) for three reasons:

(1) There is a disagreement in the threshold results of these two groups of experimenters;

(2) Our stimuli involved discrimination of the second-formant component whereas the

former experiments required discrimination of the third formant; (3) We were using full

37

Bregman, Williams, Walker & Ahad Page 38 of 64

SWS words, not the single syllables used in the cited experiments. Accordingly, we

compared the component-matching success of the Speech and Event groups at three

different amplitude levels, +2dB, -2dB and -6dB relative to the amplitude level of the

other components. It is not necessary to look for an all-or-nothing effect such that

sinusoidal components below the duplexity threshold are totally inaudible, and those

above it are totally audible. Pre-emptiveness, if it exists, should simply use up energy,

making it progressively harder to detect the properties of individual components, and this

interference should become more serious as the energy of the to-be-matched component

becomes lower.

To improve the power of our analyses, we used, as a covariate, the level at which

participants could do component-matching while performing a concurrent identification

task that was visual, rather than auditory; the component was presented after the

complex, as in our main task. The sounds we used for the covariate task, and also for

training were not the nine SWS complexes but were simplified complexes (see

description under the Stimuli heading).

Training. We repeated the Speech and Event training of the first and second experiments

on two new groups of participants. The Analytic group was omitted in this experiment.

The testing was administered in two sessions.

Method

Participants. We tested 48 new participants, young adults recruited from a university

population. None reported any hearing impairment and all reported that their first

language was English. Data from 8 were discarded due to not meeting our criteria or to

38

Bregman, Williams, Walker & Ahad Page 39 of 64

technical problems, and were replaced until 40 acceptable data sets had been obtained (14

males and 26 females).

Apparatus and Stimuli. The apparatus was the same as that used in Experiment 1. The

training and testing of each participant was controlled by computer.

The sine-wave-speech complexes used for training (Speech or Event) were the same as

those used in Experiments 1 & 2. A set of new sine-wave-speech complexes were

created, which had the second formant component in either the forward or reverse

direction and at four different amplitude levels with respect to the other components of

the same complex. There were also two sets of simplified complexes, used for the first

two tasks. In these, the highest and lowest tones (first and third “formants” of the SWS

complexes) were replaced by steady-state tones. The frequency of each was at the

average between start and stop frequency of the SWS component that it replaced. The

second component of each SWS complex was replaced by a simple glide from the start

frequency to the stop frequency of the “formant” that it replaced. These components

matched the durations of the components they replaced. The corresponding isolated F2

components in both forward and reverse directions were also created.

Procedure. The training and testing of each participant was controlled by computer.

After the last block in the second session, the experimenter asked several questions and

the responses were recorded on the participant’s data sheet.

Session #1 (Blocks 1-5)

Block 1: Training on component matching. Training (block of 36 trials) was given in

39

Bregman, Williams, Walker & Ahad Page 40 of 64

stages, beginning with component-matching on the simplified complexes described

above. If less than 25 correct matches occurred on the first attempt, the block was

repeated a second time.. Each trial consisted of a warning, the complex, then a single

component, then the question, all separated by silences.

Block 2: Pre-test. A dual visual identification and an auditory component-matching task

was given (72 trials). It was used both as a covariate for the statistical analysis and as a

pre-test, qualifying the participant to go further in the experiment. It also acted as

training for the later dual tasks. It consisted of the same visual pattern-matching task

used in Experiment 1, concurrent with the component-matching task already learned in

Block 1. The participant was encouraged to respond to the visual task without waiting

for the sounds to be completed, and was told that the visual task would be timed. Each

trial consisted of a warning, the complex, then a single component, then a question about

identity of the visual stimulus, then a question about the auditory component, all

separated by silences. Scores of less than 60 out of a maximum of 72 for visual-pattern

identification or less than 40 out of 72 for component-matching were employed to screen

out participants. Of the 8 who had to be discarded in this experiment, 4 were discarded

on this criterion.

Block 3: Identification Training. The training for identification of the complexes (Speech

or Event), and verification of the participant’s performance on this identification. On the

training trials, a message (a printed word or a description of a virtual-reality event,

depending on the condition) was displayed on the screen while the corresponding

complex was played twice. There were 3 blocks of trials, each containing all 9

40

Bregman, Williams, Walker & Ahad Page 41 of 64

complexes in a random order.

Block 4: Verification and Feedback. Verification trials presented a complex and a

proposed label for it on the screen. The participant indicated whether the description

matched the sound. Feedback was given and extra trials were added for conditions

giving rise to errors. The trials continued until at least 2-in-a-row correct matches and 2-

in-a-row correct non-matches were obtained for each sound, for a minimum of 36 and a

maximum of 80 trials.

Block 5: Practice for Criterion Task (test for pre-emptiveness). One block of the dual

task, consisting of sound identification and component-matching, was given. To make it

easier, the to-be-matched component was always 6dB above the level of the other

components. Each trial began with a proposed label for the complex, a 1500-ms delay

before the message was cleared and a further 500-ms delay before the complex was

presented. This is followed by a 1000-ms silence and then the isolated component. After

a 100-ms wait the participants were asked to confirm the identity of the first sound (a

response which they believed to be timed). Immediately after the response, a second

question – whether the component had been in the complex) was displayed and the

participant responded.

Session #2 (Blocks 6-9)

Block 6: Verification and Feedback. At the start of the second session, the verification

task of Block 4 was repeated as a reminder of the complexes to be identified and to top

up the learning of any that had been forgotten.

41

Bregman, Williams, Walker & Ahad Page 42 of 64

Blocks 7-9: Criterion Task (test for pre-emptiveness). These blocks consisted of the dual

task in which both complex identification and component matching were performed on

each trial, with the “target” component at +2dB, -2dB or -6dB relative to the level of the

other components. Intensities were randomised within blocks. The timing was the same

as Block 5 in Session #1.

Results

The single component-matching and training verification tasks were used for practice

only and were not analysed. Other data are reported below.

(Table 9 about here)

The results for identification of the complexes as holistic words or events are shown in

Table 9. Mean scores (out of 8) are shown for each stimulus, amplitude and training

condition, for each of the dual tasks. “ID#” indicates the stimulus number as shown in

Table 1. The left-hand half of Table 9 gives the results for the Speech group. The column

labelled Visual (Cov.) gives the identification scores for the visual stimuli used in the

“covariate” task. The next 4 columns show the results for the holistic identification in the

training condition (where the second component was 6dB more intense than the others)

and in the test conditions, labelled according to the amplitude level of the target

component that occurred in the concurrent task. The next column shows the mean of the

previous 4 columns. The right-hand half of the table shows the same results for the Event

group.

Table 10 shows the component-matching scores (test of pre-emptiveness) for the same set

of conditions as in Table 9,

42

Bregman, Williams, Walker & Ahad Page 43 of 64

(Table 10 about here)

Statistical Analysis: MANOVA and ANCOVA

A preliminary view of the effects was given by a 5-way analysis using the general

MANOVA model: Training (Speech vs. Event), Sex (M vs. F), Tasks (Identification of

the whole vs. Component-matching), Complexes (1-9), and Attenuation levels (1-4),.

This analysis excludes the results for the visual stimuli.

By far the greatest effect was the one distinguishing the two concurrent Tasks.

Identification of the whole was much better than Component-matching. as can be easily

seen by comparing the data in Tables 9 and 10. The grand mean for the Component-

identification task of Blocks 7-9, was 7.54 for the Identification task but only 4.78 for

the Component-matching task (maximum possible score of 8.00 in both cases). These

were significantly different, F(1,36) = 894, p < .001. There was no significant effect of

Sex, F(1,36) = .001, p = 0.92; so this variable was dropped from subsequent analyses.

Other results from the 5-way ANOVA were shown equally clearly in subsequent 3-way

analyses; so no further use was made of the 5-way results. Since pre-emptiveness was

expected to influence the Component-matching task, but not the Identification of the

complex as a whole, further analyses were performed on the data for each task

independently.

A 3-way analysis using MANOVA, applied to the Identification responses (Table 9)

indicated significant contributions of training, F(1,38) = 9.25, p = 0.004, with the Speech

group finding it somewhat easier to remember the labels than the Event group did. Again

there was very highly significant effect of complexes, F(8,304) = 5.21, p << 0.001, some

complexes being easier to label than others. There was also a significant interaction

43

Bregman, Williams, Walker & Ahad Page 44 of 64

between training condition (Speech vs. Event) and Complexes, F(8,304) = 4.25, p <

0.001, reflecting the observation that the superiority of the Speech group to the Event

group was greater for some complexes than for others.

A 3-way analysis using MANOVA, applied to the Component-Matching responses

(Table 10) indicated a very highly significant effect for Complexes, F(8,304) = 29.9, p

<< 0.001, which reflected the fact that the complexes differed considerably in how easy it

was to hear out the target component. There was also a just-significant interaction

between Complexes and Attenuation, F(16,608) = 1.67, p = 0.047; attenuation made it

harder to hear out the component of some complexes more than others. The difference

between the Speech and Event groups, 4.83 versus 4.78 (the result that tested the pre-

emptiveness hypothesis) did not approach significance, F(1,38) = 0.88, p = .357.

Correlation between covariate and criterion tasks. In order to reduce the chance that our

failure to find the difference between the Speech group and the Event group was not due

to unaccounted-for variance in the data, we used an analysis of covariance (ANCOVA) to

account for additional variance. Correlations were computed between the scores of the

component matching tasks of Block 2 (the covariate task: component-matching

concurrent with visual identification) and of Blocks 7-9 (the criterion task: component-

matching concurrent with Word or Event labelling). these were computed separately for

each level of the second component relative to the others: for +2dB, r =.58 (p < .001), for

-2dB, r =.61 (p < .001), and for -6dB, r =.51 (p ≅ .001). These levels of correlation were

deemed to be high enough to justify an analysis of covariance.

44

Bregman, Williams, Walker & Ahad Page 45 of 64

The unadjusted Speech and Event means are shown in the first two columns of Table 11.

They are not significantly different. For the +2 dB component, F(1,38) = 0.54, p = .47;

for the -2dB component, F(1,38) = 0.32, p = .58; for the -6dB component, F(1,38 = 1.60,

p = .21.

(Table 11 about here) The Speech and Event means, adjusted by covariance, are shown in the last two columns

of Table 11. The ANCOVA, like the MANOVA, showed no significant differences

between the Speech and Event conditions at any amplitude level of the target component:

For the +2dB components, F(1,37) = 1.34, p = .25, for the -2dB components, F(1,37) =

0.99, p = .33, and for the -6dB components, F(1,37) = 2.91, p = .09. Furthermore, the

unadjusted and covariance-adjusted means in Table 11 show that, contrary to the pre-

emptiveness hypothesis, which predicts that it would be easier to identify components in

the Event condition, it was actually slightly easier in the Speech condition (though not

significantly so).

Discussion of Experiment 3

We found no evidence to suggest that participants in the Speech group were

disadvantaged, relative to the Event group, in being able to identify components of the

sine-wave complex at any relative amplitude of the target component. While listeners

who heard the complexes as speech identified them better than those who heard them as

virtual-reality events, the performance of the two groups on the component-recognition

task was not significantly different. It is very likely that our range of relative amplitudes

bracketed the presumed duplexity threshold. The only relative amplitude at which the

Speech group did worse than the Event group was in the initial training with the target

45

Bregman, Williams, Walker & Ahad Page 46 of 64

component at +6dB (4.88 vs. 4.91 in last row of Table 10). This is surely well above any

presumed duplexity threshold. Thus there was no evidence to support the hypothesis that

listening to the sounds as speech invokes pre-emptive auditory processes any more than

listening to the sounds as other auditory events (both conditions in the current experiment

were synthetic, rather than analytic, listening tasks). The slight advantage for the Speech

group, if real, might have been due to the lower attention required for the identification

task, perhaps due to a lifetime of experience in phonetic discriminations or to memories

for the words in their natural form.

Liebenthal, Binder, Piorkowski, Remez (2003) found that experience in attending to the

phonetic properties of the sinusoids interfered with their component-matching task (but

only in early trials) and was accompanied by a decreased auditory cortex activation for

SWS replicas of words but not for acoustically matched non-phonetic items. They

argued that this favoured a pre-emptiveness interpretation, but their finding is not

replicated in our Experiments 1 and 3. If anything, the listeners who interpreted the

sounds as speech did a tiny bit better in our study. The difference in the two studies may

be a question of the nature of the condition to which the “speech” condition was

compared. In the Liebenthal et al study, the control condition (“naïve”) did not receive

any sort of training, while the “speech” condition (the same subjects, later in the

experiment) did. The training may have focussed the attention of the participants on

recognition, stealing attentional resources from mental representation of the components;

the authors themselves mention this possibility. In the present study, the main control

group, the Event group, received training on the recognition of the stimulus as a whole,

and was therefore better matched to the Speech group. In fact the non-speech

46

Bregman, Williams, Walker & Ahad Page 47 of 64

participants in the Liebenthal et al study were more like our Analytic group in

Experiment 1 (component-counting), which did slightly, though not significantly, better

than our other two groups.

However, even if this difference between our Analytic and the Speech groups had been

significant, it would have been inappropriate to evoke the concept of a module to explain

the results. If the inferiority of the Speech group to the Component-counting group were

taken to represent the pre-emptive effects of a speech module, then the inferiority of the

Event group to the Component-counting group would also have be taken as due to the

pre-emptive effects of a brain module – in this case, a hypothetical module that handles

the identification of all natural events except speech. It is much more plausible to believe

that the training in listening for individual components given to this group might simply

have given them a very slight advantage in isolating the individual tonal glides in the

complexes.

General discussion

Remez, Rubin, Berns, Pardo & Lang (1994) have argued that SWS could never be

integrated by presently-known principles of auditory scene analysis (ASA). The

components are not harmonically related and their frequencies and amplitudes do not

necessarily change in parallel. How then are they ever integrated? The only other

option, as they see it, is a speech module. However, we can identify three factors

contributing to integration: (1) Even in sine-wave speech, there are ASA cues that bind

the components together perceptually, namely synchrony of onset and offset of

components and co-variations in amplitude, when amplitude is allowed to vary (Barker &

47

Bregman, Williams, Walker & Ahad Page 48 of 64

Cooke, 1999). (2) Integration may be increased by top-down recognition. For example,

the native speakers of a click language integrate the clicks into the speech stream as

consonants, whereas a foreign listener hears the clicks as standing out from the speech

stream. (3) Integration may be the default and only after information builds up over time

does the auditory system partition the evidence into concurrent streams (Bregman 1990,

pp. 332-334). Therefore it is segregation, rather than integration, that requires

explanation.

This cannot be the whole story about SWS, because the components can still be heard out

by a listener even though they may be integrated for purposes of phonetic interpretation,.

We believe that this dual-level perception will be found in audition whenever the bottom-

up evidence is not overwhelming for either integration or segregation (Bregman, 1991b).

In such cases the non-decisive groupings of components will permit the schema-driven

top-down recognition processes to yield a holistic interpretation. This happens in SWS,

where both a speech interpretation and lower-level components can be heard. It also

happens in visual displays: If a large drawing of a triangle is built out of tiny circles,

both the circles and the triangle are perceived. In visual art, faces in which the features

are actually different sorts of vegetables or other objects have been painted, for example

by the Milanese artist Giuseppe Arcimboldo in the late sixteenth century (Bompiani,

1987; Bregman, 1990, p. 471). We see the individual vegetables, and the whole face.

The present experiments provide evidence that when we exploit the ambiguous nature of

sine-wave-speech by inducing speech and non-speech biases for the same signals, these

biases have no effect on the hearing out of acoustic components. We have to consider,

then, the possibility that giving any sort of holistic interpretation to a complex suppresses

48

Bregman, Williams, Walker & Ahad Page 49 of 64

the perception of its acoustic components. This explanation is not implied by the results

of Experiment 1. While the Event group did hear out the components less well than the

Analytic group (74% vs. 77%), this difference was not significant.

In discussing Experiment 1, we presented the criticism that our dual tasks may not have

forced our listeners to perform the component-matching while still in speech-perception

mode. However, in our final experiment we adjusted the task to make it very difficult to

avoid being in the speech mode before judging a individual component. This was a much

stricter level of control than in the experiment by Whalen & Liberman (1987), where the

difficulty in hearing a weak sine-wave component continued to occur – due to a

hypothesized pre-emption of energy by the speech module – even when the phonetic and

auditory tasks were presented in separate blocks. Therefore, it seems unlikely that rapid

mode-switching under the control of the listener would block this pre-emption just in our

particular case, especially given the design of Experiment 3, where the holistic

interpretation was emphasised and the target component was not given until after the

whole complex.

Our conclusion is that hearing SWS complexes as speech does not pre-empt the

information about auditory characteristics in any way different from hearing the same

sounds as other environmental events (if at all). In this conclusion we agree with that of

Remez, Pardo, Piorkowski, and Rubin (2001) that the same signal can be interpreted as

speech or as clusters of components. However, we do not agree that these two kinds of

percepts necessarily imply the existence of separate “modules”. For example, we would

not want to say that when a person sees, at the same time, both an array of tiny circles

and the large triangle whose outline they form, this duality of perception (or what Alvin

49

Bregman, Williams, Walker & Ahad Page 50 of 64

Liberman called “triplex perception”) implies the existence of two distinct modules. Two

distinct (and perhaps parallel) recognition processes, yes; but not two separate

“modules”.

If the phonetic module, proposed by Liberman and his colleagues, really is a module in

the sense defined by Fodor (1983), its operation should be “mandatory” and “cognitively

impenetrable” (i.e., not able to be influenced by the cognitive processes, schemas, or

biases of the listener). Contrary to these requirements, not all listeners spontaneously hear

SWS as speech; so speech perception is not always mandatory. Furthermore, it is possible

to train listeners either to hear or not hear a signal as speech; so speech perception is not

cognitively impenetrable. We conclude that phonetic perception is not a module in

Fodor’s sense. The apparent obligatory perception of the normal speech waveform as

speech probably comes from the endless training provided by our experience from

earliest infancy. It is just as obligatory for literate people to recognize a canonical

version of a character from their writing system, or a picture of their country’s flag.

Other unpublished results from our laboratory argue against cognitive impenetrability of

speech perception. In a series of steps, we gradually morphed SWS into normal-

sounding speech. We did so by passing normal speech through a temporally varying set

of filters, each of which tracked one of the first four formants. With very narrow

bandwidths, the signal closely resembled SWS. With wider bandwidths it came to

resemble the original speech signal. If we created a series of steps with gradually

increasing bandwidth, listeners who heard the whole series, beginning with SWS,

required a higher bandwidth to correctly interpret the speech than listeners who started

later in the series of widening bandwidths. It appears that early incorrect hypotheses

50

Bregman, Williams, Walker & Ahad Page 51 of 64

interfered with the interpretation of the signal. This effect of hypotheses shows that

Fodor’s requirement that modules be cognitively impenetrable is not always found in

speech perception. Our guess is that impenetrability is found when the signal over-

determines the phonetic interpretation. This is an instance of a more general rule. When

any causal factor pushes any behavioural or mental response to its maximum value, a

second factor will appear to have no effect. To see the effect of a second cause (e.g., , a

top-down factor), one has to weaken the effect of the first one (e.g., the bottom-up

influence on the interpretation). This can be done with sine-wave speech or other

degraded depictions of familiar objects, such as very simplified drawings.

51

Bregman, Williams, Walker & Ahad Page 52 of 64

References

Bailey, P.J., Dorman, M.F., Summerfield, A.Q. (1977). Identification of sine-wave

analogs of CV syllables in speech and non-speech modes. Journal of the Acoustical

Society of America, 61, S(A).

Bailey, P. J., & Herrmann, P., (1993). A re-examination of duplex perception evoked by

intensity differences, Perception & Psychophysics, 54(1), 20-32.

Barker, J., & Cooke, M. (1999). Is the sine-wave cocktail party worth attending? Speech

communication, 27, 159-174.

Bompiani (Publisher). (1987) Effeto Arcimboldo. Milan: Bompiani.

Bregman, A.S. (1990). Auditory scene analysis: the perceptual organization of sound.

Cambridge, Mass.: The MIT Press, (Paperback 1994).

Bregman, A.S. (1991). The compositional process in perception and cognition with

applications to speech perception. In I.G. Mattingly & M. Studdert-Kennedy (Eds.),

Modularity and the motor theory of speech perception. Hillsdale, N.J.: Erlbaum.

Bregman, A.S., and Walker, B.N. (1995). Does the "speech mode" pre-empt acoustic

components? Unpublished manuscript, Dept. of Psychology, McGill University.

Fodor, J.A. (1983). The modularity of mind. Cambridge, Mass.: The MIT Press.

Liberman, A. M. (1982). On finding that speech is special. American Psychologist, 37(2),

148-167. (Reprinted In: Handbook of Cognitive Neuroscience, Ed. by Michael S.

Gazzaniga. (1984) Plenum Press: New York, 169-197.)

52

Bregman, Williams, Walker & Ahad Page 53 of 64

Liberman, A. M., Isenberg, D., & Rakerd, B. (1981). Duplex perception of cues for stop

consonants, Perception & Psychophysics, 30, 133-143.

Liberman, A. M., & Mattingly, I. G. (1985). The motor theory of speech perception

revisited, Cognition, 21, 1-36.

Liberman, A.M., & Mattingly, I.G. (1989). A specialization for speech perception.

Science, 243, 489-494.

Liebenthal, E., Binder, J.R, Piorkowski, R.L., & Remez, R.E. (2003). Short-term

reorganization of auditory analysis induced by phonetic experience. Journal of

Cognitive Neuroscience. 15(4), 549-558.

Mattingly, I. G. & Liberman, A. M. (1988). Specialized perceiving systems for speech

and other biologically significant sounds. In G. M. Edelman, W. E. Gall, and W. M.

Cowan (Eds.). Functions of the Auditory System. (pp. 775-793). New York: Wiley.

Peterson, G. E., & Barney, H. L., 1952, Controls used in the study of the vowels, Journal

of the Acoustical society of America, 24 (2), 175-184.

Remez, R.E., Pardo, J.S., Piorkowski, R.L.. & Rubin, P.E. (2001). On the bistability of

sine wave analogues of speech. Psychological Science. 12(1), 24-29.

Remez, R.E., Pardo, J.S., & Rubin, P.E.(1992). Making the auditory scene with speech.

Unpublished Manuscript, New York, Barnard College.

Remez, R. E., Rubin, P. E., Berns, S. M., Pardo, J. S. & Lang, J. M. (1994). On the

Perceptual Organization of Speech, Perception & Psychophysics, 101(1), 129-156.

53

Bregman, Williams, Walker & Ahad Page 54 of 64

Remez, R. E., Rubin, P. E., Pisoni, D. B., & Carrell, T. D. (1981). Speech perception

without traditional speech cues, Science, 212, 947-950.

Whalen, D. H., & Liberman, A. M. (1987). Speech perception takes precedence over

nonspeech perception, Science, 237, 169-171.

Whalen, D. H., & Liberman, A. M. (1996). Limits on phonetic integration in duplex

perception, Perception & Psychophysics, 58(6), 857-870.

54

Bregman, Williams, Walker & Ahad Page 55 of 64

Tables

Table 1. Sequential ordering, visual shapes, words, virtual-reality events and number of components associated with each tonal complex.

ID of complex

1 2 3 4 5 6 7 8 9

Visual Stimuli

* * *

* * * * * *

* * * *

* *

* * * * * ** ** * * *

* * *

* * ** *

*

* * * * * * * * * *

* * * * *

* * * * *

* * * * * * * * *

** **

*** ** *** ** **

** * *

* ** *

**

SWS words

beak sill wed pass lark rust jaw shook coop

Virtual-reality event

Volks-wagen horn

slide space-ship door

opening

baby seal

fire-man

sliding down pole

sound system’s feedback

squee-gee

wiping glass

dripping faucet

stone falling

into water

Number of tonal compo-nents

3 4 3 4 3 4 3 3 3

55

Bregman, Williams, Walker & Ahad Page 56 of 64

Table 2. Overall mean performance (percent correct) on the five tasks,. Block numbers are shown in parenthesis. Tasks in the same block are concurrent.

Task (Block number) N Min. Max. MEAN Std. Err. Std. Dev.

Tone-matching Task (1) 50 61 89 80 0.9 6.6

Visual Task (2) 50 88 100 98 0.4 2.7

Tone-matching Task (2) 50 56 92 81 1.1 7.4

Complex-Identification Task (5)

50 43 99 76 2.5 18.0

Tone-matching Task (5) 50 51 89 76 1.3 9.1

56

Bregman, Williams, Walker & Ahad Page 57 of 64

Table 3. Results. Mean percent correct performance on five tasks (unadjusted by covariance). The block numbers are in parenthesis. Tasks within the same block are concurrent. Note that chance performance is 50 percent.

Bias Group

N Tone Matching (1)

Visual Matching (2)

Tone Matching (2)

Complex Identifica-

tion (5)

Tone Matching (5)

Speech 18 77 98 81 84 75

Virtual Reality

17 80 98 80 73 74

Analytic 15 79 99 81 70 77

57

Bregman, Williams, Walker & Ahad Page 58 of 64

Table 4. Percent correct scores (rounded) for the nine signals on each of the Tone Matching tasks for the three groups [W = Word; E = virtual reality Event; A = Analytic (component counting)].

Tone Task

Alone with Visual task

(prior to biasing)

with Complex Id. task

(after biasing)

Group W E A W E A W E A Mean

coop 69 71 67 75 74 72 69 72 63 70

shook 66 68 68 68 71 68 71 73 76 70

rust 61 75 73 77 67 74 62 66 67 69

jaw 77 83 74 84 76 77 73 68 66 75

lark 76 78 76 81 82 77 68 69 76 76

pass 80 79 82 78 82 83 71 71 86 79

sill 75 81 86 81 84 92 86 75 88 83

wed 94 89 86 88 85 88 87 82 78 86

beak 94 96 100 98 97 100 93 90 95 96

Mean 77 80 79 81 80 81 75 74 77

Table 5. The 9 additional sine wave complexes: vowels and the frequency values for their 1st, 2nd and 3rd “formants” (sinusoidal partials) and the partial used as an “incorrect” alternative to the 2nd partial in the Tone-matching task.

Order Vowel in... 1st partial 2nd partial 3rd partial alt. 2nd partial

1 “Beet” 319 2731 3011 2617

2 “Bit” 398 2027 2617 1147

3 “Bet” 599 2027 2731 1147

4 “Bat” 771 2027 3011 1147

5 “Bob” 771 1147 2617 929

6 “Bought” 599 929 2617 771

7 “Book” 398 929 2027 771

8 “Boot” 319 929 2371 771

9 “But” 599 1147 2371 929

58

Bregman, Williams, Walker & Ahad Page 59 of 64

Table 6. Percent correct performance on four tasks. The block numbers are in parentheses. Note that chance performance is 1:2 (50 percent) for blocks 2 and 3, 1:9 (11.11 percent) for Vowel identification and 1:7 (14.28 percent) for Component-counting.

Group #

Group Name

N Label Verification

(2)

Matching components in

vowels (3)

Vowel identity

matching (4 or 5)

Component-counting (4 or 5)

1 Speech-1 16 96.99 64.12 34.72 61.23 2 Speech-2 8 69.21 67.13 29.63 66.67

3 Speech-3 8 71.76 61.57 25.23 65.97

4 Event-1 17 86.93 66.01 22.88 66.88 5 Event-2 8 75.23 68.75 24.07 64.35

6 Event-3 8 69.44 59.95 24.31 60.42

7 Analytic-1 16 68.75 74.88 26.62 64.47 8 Analytic-2 8 62.96 71.76 27.31 66.90

9 Analytic-3 9 51.03 61.32 25.72 61.32

59

Bregman, Williams, Walker & Ahad Page 60 of 64

Table 7. Results of planned comparisons of appropriateness of training.

Task

Comparison Complex Verification

Tone matching

Vowel matching

Component counting

Speech group: Appropriate vs. Inappropriate pairing

F(1,89) = 62.3

p < 0.0001*

F(1,89) = .002

p = 0.914

F(1,89) = 5.40

p = 0.021*

F(1,89) = 1.28

p = 0.260

Event group: Appropriate vs. Inappropriate pairing

F(1,89) = 19.4

p < 0.0001*

F(1,89) = 0.13

p = 0.716

F(1,89) = 0.18

p = 0.675

F(1,89) = 1.03

p = 0.313

Component-Counting group: Appropriate vs. Inappropriate pairing

F(1,89) = 12.6

p < 0.0001*

F(1,89) = 3.33

p = 0.068

F(1,89) = .001

p = 0.924

F(1,89) = .007

p = 0.895

Note: * indicates significance at the 5% level or better after Bonferroni adjustment for multiple comparisons

60

Bregman, Williams, Walker & Ahad Page 61 of 64

Table 8. Chance probability values for differences among the three groups (Speech / Virtual-Reality Event / Component Counting) on four tasks, for the groups that had Appropriate training. Degrees of freedom are 1 and 89 for all tests.

Task

Comparison Verifi-cation

Tone-matching

Vowel-matching

Component counting

Speech vs. Event F = 9.25

p = 0.003*

F = 0.17

p = 0.682

F = 14.7

p < 0.001*

F = 1.63

p = 0.202 Speech vs. Component Counting F = 70.7

p < 0.0001*

F = 5.38

p = 0.021

F = 6.66

p = 0.011*

F = 0.52

p = 0.480 Event vs. Component Counting F = 30.2

p < 0.0001*

F = 3.77

p = 0.052

F = 1.47

p = 0.227*

F = 0.30

p = 0.594

Note: * indicates significance at the 5% level, after Bonferroni adjustment for multiple comparisons

61

Bregman, Williams, Walker & Ahad Page 62 of 64

Table 9. Identification Scores (out of 8). Experiment 3.

Identification ScoresSpeech Group Event Group

ID # Visual A0 A1 A2 A3 Mean Visual A0 A1 A2 A3 Mean Mean audio(Cov.) +6dB +2dB -2dB -6dB Audio (Cov.) +6dB +2dB -2dB -6dB Audio for 2 groups

1 7.95 8.00 7.80 7.90 7.85 7.89 7.85 7.75 7.90 7.85 8.00 7.88 7.882 7.80 7.25 7.45 7.60 7.55 7.46 7.75 6.65 7.10 7.30 7.20 7.06 7.263 7.85 8.00 7.90 7.75 7.90 7.89 7.85 7.60 7.70 7.70 7.75 7.69 7.794 7.90 7.75 7.90 7.85 7.75 7.81 7.80 7.45 7.45 7.30 7.30 7.38 7.595 7.75 7.95 7.95 7.90 7.95 7.94 7.80 6.80 7.05 7.50 7.35 7.18 7.566 7.90 7.70 7.95 7.90 7.85 7.85 7.80 6.75 7.40 7.30 7.50 7.24 7.547 7.90 7.90 8.00 7.95 7.95 7.95 7.80 7.60 7.70 7.80 7.70 7.70 7.838 7.90 7.20 7.90 7.70 7.70 7.63 7.85 5.80 6.35 6.30 6.25 6.18 6.909 7.90 7.50 7.35 7.45 7.50 7.45 7.85 7.55 7.80 7.35 7.60 7.58 7.51

Mean 7.87 7.69 7.80 7.78 7.78 7.76 7.82 7.11 7.38 7.38 7.41 7.32 7.54

62

Bregman, Williams, Walker & Ahad Page 63 of 64

Table 10. Component-matching, Experiment 3. Mean scores (out of 8) by sound, amplitude and training condition, for each of the dual tasks. (The +6dB condition occurred only on the training trials.)

Component-matchingSpeech Group V-R Group

ID # Visual A0 A1 A2 A3 Mean Visual A0 A1 A2 A3 Mean Mean audio(Cov.) +6dB +2dB -2dB -6dB Audio (Cov.) +6dB +2dB -2dB -6dB Audio for 2 groups

1 5.60 5.05 4.75 4.55 4.55 4.73 5.85 5.35 4.75 4.50 4.35 4.74 4.732 7.35 6.45 5.90 6.25 5.55 6.04 7.20 6.30 6.10 6.35 6.05 6.20 6.123 6.40 4.25 4.30 4.00 4.60 4.29 7.15 4.15 4.15 4.30 3.95 4.14 4.214 6.15 3.95 3.95 4.15 4.05 4.03 6.00 4.20 3.95 4.00 3.80 3.99 4.015 6.45 5.10 5.80 4.80 5.55 5.31 6.55 5.45 5.60 4.95 5.20 5.30 5.316 6.80 5.55 4.80 5.00 5.30 5.16 7.00 4.80 4.65 4.25 4.45 4.54 4.857 6.05 3.90 4.00 4.05 3.95 3.98 6.55 4.00 4.05 4.15 4.00 4.05 4.018 6.40 4.50 4.65 4.55 4.40 4.53 6.40 4.40 4.25 4.15 3.90 4.18 4.359 6.20 5.15 5.65 5.65 5.30 5.44 5.85 5.55 5.15 5.35 5.50 5.39 5.41

Mean 6.38 4.88 4.87 4.78 4.81 4.83 6.51 4.91 4.74 4.67 4.58 4.72 4.78

63

Bregman, Williams, Walker & Ahad Page 64 of 64

Table 11. Mean component matching for Speech and Event groups.

Level of F2 component

Unadjusted Mean in

Speech group

Unadjusted Mean in

Event group

Covariance-Adjusted Mean in Speech group

Covariance-Adjusted Mean in Event group

+2dB 43.8 42.7 44.0 42.5

-2DB 43.0 42.0 43.2 41.8

-6dB 43.3 41.2 43.4 41.0

64