listening with eye and hand: cross-modal contributions to speech

18
HCJ8'IUns Laboratories Status &porl 00 Speech 1991, SR-107/108, 63-80 Listening with Eye and Hand: Cross-modal Contributions to Speech Perception* Carol A. Fowlert and DawnJ. Deklett Three experiments investigated the basis for the -McGurk effect" whereby optically- specified syllables experienced synchronously with acoustically-specified syllables integrate in perception to determine a listener's auditory perceptual experience. One hypothesis is that the effect arises when optical and acoustic cues for a syllable are associated in memory. A second hypothesis is that the effect arises when cross-modal information, familiar or not, is convincingly about the same articulatory speech event in the environment. Experiments contrasted the cross-modal effect of orthographic syllables on acoustic syllables, presumed to be associated in experience and memory, with that of haptically-experienced and acoustic syllables, presumed not to be associated. Findings were that the latter pairing, but not the former, gave rise to cross-modal influences under conditions in which subjects were informed that cross-modal syllables were paired independently. Mouthed syllables felt by a perceiver affected reports of simultaneously heard syllables (and reports of the mouthed syllables were affected by the heard syllable). These effects were absent when syllables were simultaneously seen (spelled) and heard. We conclude that the McGurk effect does not arise due to association in memory, but rather to conjoint near specification of the same causal source in the environment..in speech, the moving vocal tract producing phonetic gestures. In a variety of circumstances, including, pre- sumably, face-to-face spoken communications out- side the laboratory, listeners can acquire phonetic information optically as well as acoustically. Seeing the face of a speaker considerably improves listeners' abilities to recover speech produced in noise (e.g., Erber, 1969; Ewertsen & Nielsen, 1971; Sumby & Pollack, 1954), and, when a visible talker and his or her apparent acoustic output are mismatched in an experiment, as in the so-called "McGurk effect" (e.g., McGurk & MacDonald, 1976; MacDonald & McGurk, 1978), phonetic information recovered optically may override that recovered acoustically. This is particularly likely to occur when the phonetic information is for The research reported here was supported by NICHD Grant HD-O 1994 to Haskins Laboratories. We thank George L. Wolford for his participation in the pilot research, his guidance on some of the statistical analyses and for his comments on an earlier draft of the manuscript; we also thank Lawrence Rosenblum for comments on the manuscript and Michael Twvey for comments on our General Discussion. 63 consonantal places of articulation close to the front of the speaker's mouth. Accordingly, optical "tap" paired with acoustic "map" may be reported as "nap," with voicing and nasality consistent with the acoustic signal and place of articulation con- sistent with the optical display (MacDonald & McGurk, 1978; Summerfield, 1987). Remarkably, the cross-modal influence, phe- nomenally, is not due to hearing one utterance, seeing another and reporting some compromise. Rather the visible utterance generally changes what the listener experiences hearing (Liberman, 1982; Summerfield, 1987). Accordingly, the visual influence remains when subjects are instructed to report specifically what they heard rather than what the speaker said (e.g., Summerfield & MacGrath, 1984), and it remains even after con- siderable practice attending selectively (Massaro, 1987). Two general accounts of the McGurk effect may be derived from current theories of speech perception. One is that perceivers consult memory representations of fundamental units of spoken

Upload: hoanganh

Post on 05-Feb-2017

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Listening with Eye and Hand: Cross-modal Contributions to Speech

HCJ8'IUns Laboratories Status &porl 00 Speech &~arch1991, SR-107/108, 63-80

Listening with Eye and Hand: Cross-modal Contributionsto Speech Perception*

Carol A. Fowlert and DawnJ. Deklett

Three experiments investigated the basis for the -McGurk effect" whereby optically­specified syllables experienced synchronously with acoustically-specified syllablesintegrate in perception to determine a listener's auditory perceptual experience. Onehypothesis is that the effect arises when optical and acoustic cues for a syllable areassociated in memory. A second hypothesis is that the effect arises when cross-modalinformation, familiar or not, is convincingly about the same articulatory speech event inthe environment. Experiments contrasted the cross-modal effect of orthographic syllableson acoustic syllables, presumed to be associated in experience and memory, with that ofhaptically-experienced and acoustic syllables, presumed not to be associated. Findingswere that the latter pairing, but not the former, gave rise to cross-modal influences underconditions in which subjects were informed that cross-modal syllables were pairedindependently. Mouthed syllables felt by a perceiver affected reports of simultaneouslyheard syllables (and reports of the mouthed syllables were affected by the heard syllable).These effects were absent when syllables were simultaneously seen (spelled) and heard.We conclude that the McGurk effect does not arise due to association in memory, butrather to conjoint near specification of the same causal source in the environment..inspeech, the moving vocal tract producing phonetic gestures.

In a variety of circumstances, including, pre­sumably, face-to-face spoken communications out­side the laboratory, listeners can acquire phoneticinformation optically as well as acoustically.Seeing the face of a speaker considerably improveslisteners' abilities to recover speech produced innoise (e.g., Erber, 1969; Ewertsen & Nielsen,1971; Sumby & Pollack, 1954), and, when a visibletalker and his or her apparent acoustic output aremismatched in an experiment, as in the so-called"McGurk effect" (e.g., McGurk & MacDonald,1976; MacDonald & McGurk, 1978), phoneticinformation recovered optically may override thatrecovered acoustically. This is particularly likelyto occur when the phonetic information is for

The research reported here was supported by NICHD GrantHD-O1994 to Haskins Laboratories. We thank George L.Wolford for his participation in the pilot research, his guidanceon some of the statistical analyses and for his comments on anearlier draft of the manuscript; we also thank LawrenceRosenblum for comments on the manuscript and MichaelTwvey for comments on our General Discussion.

63

consonantal places of articulation close to thefront of the speaker's mouth. Accordingly, optical"tap" paired with acoustic "map" may be reportedas "nap," with voicing and nasality consistent withthe acoustic signal and place of articulation con­sistent with the optical display (MacDonald &McGurk, 1978; Summerfield, 1987).

Remarkably, the cross-modal influence, phe­nomenally, is not due to hearing one utterance,seeing another and reporting some compromise.Rather the visible utterance generally changeswhat the listener experiences hearing (Liberman,1982; Summerfield, 1987). Accordingly, the visualinfluence remains when subjects are instructed toreport specifically what they heard rather thanwhat the speaker said (e.g., Summerfield &MacGrath, 1984), and it remains even after con­siderable practice attending selectively (Massaro,1987).

Two general accounts of the McGurk effect maybe derived from current theories of speechperception. One is that perceivers consult memoryrepresentations of fundamental units of spoken

Page 2: Listening with Eye and Hand: Cross-modal Contributions to Speech

64 Fowler lind DeJde

utterances (prototypes of syllables in Massaro'stheory; e.g., 1987; 1989) that include specificationsboth of optical and acoustic "'cues· for theutterance. Presumably, the memory represen­tations derive from experience with tokenproductions of the utterance in which varioussubsets of the optical and acoustic cues weredetected by the perceiver. When partiallyconflicting optical and acoustic cues are detectedin a McGurk procedure, the listener selects-andexperiences hearing-the memory representationmost consistent with the collection of cues. In thiskind of account, the influence of the visual displayon the percept derives from the association ofoptical and acoustic cues in memory, and theseassociations, in turn, derive from the associationof the cues in the world as sampled by theperceptual systems.

A different general account of the phenomenonderives from theories of speech perceptionclaiming that listeners to speech do not hear theacoustic signal per se, but rather hear thephonetically-significant gestures of the vocal tractthat give rise to the acoustic speech signal. Twosuch theories are the motor theory of speechperception (Liberman, Cooper, Shankweiler, &Studdert-Kennedy, 1967; Liberman & Mattingly,1985) and the direct-realist theory (Fowler, 1986;Rosenblum, 1987).

The motor theory was developed to explainfindings suggesting that, at least in some circum­stances, there is a closer correspondence betweenthe listener's percept and the vocal-tract gesturesof the talker than between the percept and theacoustic signal. (See Liberman et aI., 1967 forsome examples.) According to the motor theorists,a speech mode of perception in which articulationis recovered from acoustics evolved to handle thespecial problems associated with recovering pho­netic information encoded in the acoustic signal.The encoding occurs because consecutive conso­nants and vowels in an utterance are coarticu­lated-that is, produced in overlapping timeframes. In consequence of coarticulation, there areno boundaries between phonetic segments in thesignal, and the acoustic information for a conso­nant or vowel is highly context-sensitive.According to the theory, listeners use the acousticspeech signal to devise a hypothesis concerningthe set of articulatory gestures that, when coar­ticulated, would have given rise to that signal.Testing the hypothesis involves the central com­ponent of the listener's speech motor system (an"'innate vocal tract synthesizer" according to

Liberman & Mattingly, 1985) and that, in the the­ory, is how the percept acquires a motor character.

The direct-realist theory accepts the motortheorists' evidence that listeners to speech recoverphonetically-significant gestures of the vocal tract,but it explains the recovery differently. In thetheory (derived from Gibson's more general theoryof direct perception; e.g., 1966, 1979), perception isthe only means by which organisms can know theenvironment in which they participate as actors.It is crucial, therefore, that perceptual systemsgenerally acquaint perceivers with relevantaspects of the environment. Perceptual systems doso by extracting information from media, such aslight, skin and air, about properties of theenvironment. These media can provide infor­mation about the environment, because propertiesof the environment cause distinctive patterningsin them; the distinctive patterns, in turn, canserve as information for their causal source. Invision, although retinas are stimulated byreflected light, perceivers do not see patterns inlight; rather, they see the environmental sourcesof those patterns. In haptics, perceivers do notinfer palpated object properties from felt skindeformations; they feel the object sources of theskin deformations themselves. By analogy, inhearing, and specifically in speech perception,listeners should not perceive the acoustic signal,but rather its cause in the environment. Inspeech, the immediate cause of the signal is thevocal tract activity of the talker.

In either theory, the motor theory or the directrealist theory, the McGurk effect, and audio-visualinfluences on speech perception more generally,arise because the optical and acoustic informationis convincingly about the same speech event, andspeech events, not structured media, areperceptual objects.

In the following experiments, we attempt todistinguish the two general accounts we haveoffered for the McGurk effect. In its globalcharacter, the effect is consistent with bothaccounts. Perceivers frequently both see and heara speaker, and so they have ample opportunity todevelop memory representations including bothoptical and acoustic cues. By the same token,outside the laboratory, visible talking and audibletalking emanating from the same location in spaceare the same event of talking; if the objects ofperception are environmental, as specified byinformation in media, then it is expected thatinformation in different media that are jointconsequences of the same event (or, in the

Page 3: Listening with Eye and Hand: Cross-modal Contributions to Speech

Listening with Eye and Hand: Cross-modal Contributions to Speech Perception 65

McGurk procedure, of ostensibly the same event)serve jointly to specify the event to the perceiver.

In the following experiments, we haveattempted to distinguish the two accounts bylooking for cross-modal influences on speechperception from two new sources each of whichcaptures one, but not the other, distinctive aspectof the McGurk paradigm that might account forthe cross-modal influences there.

One situation was meant, on the one hand, tocapture the association in experience, and hencein memory, of an acoustically-specified utterancewith a specification in another modality. On theother hand, it was meant to exclude associationvia conjoint lawful specification of a commonenvironmental event. To achieve this, we pairedspoken syllables with matched or mismatchedorthographic representations of syllables (cf.Massaro, Cohen, & Thompson, 1988). Our college­student subjects have been readers of analphabetic writing system for over a decade, andthey experience redundant pairings of sight andsound whenever they read aloud or else see a textthat someone else is reading aloud. Althoughlisteners may be less experienced with sound­spelling pairings than with pairings of the soundand sight of a speaker, their experience with theformer pairings are sufficient that a spokenmonosyllabic word can activate its spelling for alistener. For example, detection of an auditorily­presented word that rhymes with a cue word alsopresented auditorily is faster if the two words arespelled similarly (e.g., "pie"-"tie") than if they arespelled differently (e.g., "rye"-"tie") (Seidenberg &Tanenhaus, 1979; see also Tanenhaus, Flanigan,& Seidenberg, 1980). (In an experiment similar tothe orthographic condition of our Experiment 1,Massaro et al. (1988) reported weak cross-modalinfluences of a written on a spoken syllable. OurExperiments 1 and 2 explore the conditions underwhich this occurs and compare the effect toanother possible source of cross-modal influenceson speech perception.)

An important characteristic of the acoustic­orthographic pairings for our purposes is thattheir association is by convention, rather than bylawful causation (cf. Campbell, 1989). That is,while optical and acoustic correlates of a speakertalking are associated in the world because theyare lawful consequences of the same event oftalking, graphemes and their pronunciations areassociated in the world by societal convention.

The second experimental situation that wedevised was meant, insofar as possible, to becomplementary to the first. That is, we estab-

lished a cross-modal pairing that is unfamiliar tosubjects, but that-outside the laboratory-is alawful pairing, because the same environmentalevent gives rise to structure in the two differentmedia. In this situation, we paired acoustically­specified syllables with matched and mismatchedmanually-felt, mouthed syllables. Our guess,confirmed by our subjects, was that they did notrecollect experiences in which they had handledsomeone's face while he or she was talking.Although some of them may have had suchexperiences-as infants or young childrenperhaps-they must have had considerably lessexperience of that sort than they have had eitherseeing and hearing spoken utterances or seeingand hearing text being read.

We know that some phonetic information can beobtained by feeling the face and neck of a speaker.Helen Keller learned to speak English, Frenchand German by obtaining phonetic informationhaptically (e.g., Keller, 1903). Other deaf-blind in­dividuals have learned some speech and havelearned to understand spoken language remark­ably well using the Tadoma method, in which theylearn to detect phonetic properties of speech hap­tically (e.g., Chomsky, 1986; Schultz, Norton,Conway-Fithian, & Reed, 1984). We do not knowwhether inexperienced perceivers can recoverphonetic information by touch, and it is importantfor our test that our subjects not be trained.Accordingly, we selected a fairly distinct articula­tory difference for them to detect, the differencebetween /hal and /gal.

Our expectations were as follows. If theoperative factor in the McGurk effect is theassociation in memory of cross-modal cuesavailable during events in which speech occurs,then an influence of written syllables on heardsyllables should occur, but an effect of felt onheard syllables should be weak or absent. If theoperative factor is, instead, the (apparently)common causal source in the environment of theacoustic and optical structure, then felt syllableswill affect what listeners report hearing, whileorthographic representations of spoken syllableswill not. Alternatively, of course, both (or neither)factor may be important.

EXPERIMENT 1Our first experiment tested these predictions by

looking for cross-modal haptic and orthographicinfluences on identifications ofheard syllables andfor reverse effects of heard syllables on reports offelt and read syllables. In this experiment and thenext, we informed subjects that cross-modal

Page 4: Listening with Eye and Hand: Cross-modal Contributions to Speech

66 Fowler tmd Dekle

syllables were paired independently, and werequested that, therefore, they make theirjudgments of heard syllables based only on whatthey heard; we did so in an effort to restrict cross­modal effects to influences on perception ratherthan on judgment.

MethodsSubjects. Subjects were 23 undergraduate

students at Dartmouth College. All were nativespeakers of English who reported normal hearingand normal or corrected vision. None of theparticipants was experienced in the Tadomamethod and none had special training inlipreading. Of the 23 participants, one student'sdata were eliminated from the analyses when shereported in debriefing that she had not believedour (accurate) statement that felt and heardsyllables had been paired independently. Datafrom 10 additional subjects were eliminated frommost analyses, because their identifications of feltsyllables did not significantly exceed chance(60.6% correct; z = 1.64).

Stimulus materials. Acoustic stimuli were three­formant synthetic consonant-vowel (CV) syllablesproduced by the serial resonance synthesizer atHaskins Laboratories. There were 10 syllables inall, ranging across an acoustic continuum of F2transitions from a rising transition appropriate forIba! to a falling transition appropriate for Iga!. F3was fixed and rising across the continuum;accordingly, there were no intermediate Ida!syllables. Steady-state values for the threeformants were 800, 1190 and 2400 Hz (withbandwidths of 50, 100 and 200 Hz, respectively).Starting frequencies of F1 and F3 were 500 and2100 Hz. Starting frequencies of F2 ranged from760 Hz at the IbI end of the continuum to 1660 Hzat the Ig/ end in 100 Hz steps. Transitions were 50ms in duration with a following 150 ms steady­state. Fundamental frequency increased by 10 Hzover the first 30 ms of each syllable to a steady­state value of 120 Hz and declined by 10 Hz overthe last 40 ms. The amplitude contour had ananalogous shape and time course.

The ten continuum members were stored(filtered at 10 KHz and sampled at 20 KHz) in asmall computer (New England Digital Company)programmed to run the various conditions of theexperiment. They were presented to listeners overa loudspeaker situated to the right of the CRTscreen.

In the orthographic condition, on each trial,"BA" or "GA" was printed in upper case on theCRT screen simultaneously with the presentation

of the acoustic syllables; the printed syllableremained on the screen until subjects pressed akey to initiate the next trial.

In the Tadoma condition, mouthed syllableswere Iba! and Iga! produced by a model (the firstauthor). The model faced a CRT screen thatspecified the syllable to be mouthed on each trialand provided a countdown permitting themouthed syllables to be produced in synchrony ornear synchrony with the acoustically presentedsyllable.

A single, 60-item test order of the acousticcontinuum members was used for all conditions ofthe experiment. Six tokens of each continuummember were presented in random order but withthe constraint that each synthetic syllable occurtwice in each third of the test order. In theorthographic condition, a second 60-item testorder determined which orthographic syllablewould be paired with each acoustic syllable. Thesame test order dictated the order of syllables tobe mouthed in the Tadoma condition. Thissequence was also random, but now with theconstraint that each orthographic (felt) syllable bepaired with each synthetic syllable once in eachthird of the test order.

Procedure. Each subject participated in threetests, synthetic syllables alone (auditorycondition), synthetic syllables paired with printedsyllables (orthographic) and synthetic syllablespaired with felt syllables (Tadoma). The order ofthe conditions was counterbalanced, with twosubjects (of the twelve who exceeded chance inidentifying felt syllables) experiencing each order.

Subjects were run individually. They first heardthe endpoint syllables from the synthetic-speechcontinuum. The endpoints were identified forthem and played several times. Subjects were toldthat the speech was produced by a computer andthat they would be identifying syllables like theones they had just heard in subsequent tests.

In the auditory test, subjects were seated infront of the CRT screen. Printed on the screen wasthe message "PRESS RETURN TO PROCEED."To initiate each trial, subjects pressed the returnkey. On each trial, one synthetic syllable waspresented over the loudspeaker. Subjects madetheir responses by circling printed "B" or "G" onan answer sheet; they were instructed to guess ifnecessary, and then to continue the test at theirown pace by pressing the return key for each trial.

In the orthographic condition, the test wassimilar except that when the return key waspressed, a printed syllable appeared on the screenwith its onset simultaneous with the onset of the

Page 5: Listening with Eye and Hand: Cross-modal Contributions to Speech

Ustening with Eye and Hand: Cross-modal Contributions to Speech Perception 67

synthetic syllable. The printed syllable remainedon the screen until the subject pressed the returnkey for the next trial. l Subjects were instructed towatch the screen as the printed syllable wasdisplayed. They then made two responses, firstcircling either "B" or "G" under the heading"heard," indicating which syllable they had heardand then circling "B" or "G" under the heading"saw" indicating which syllable they had seen.Subjects were told explicitly that the acoustic andspelled syllables were independently paired sothat they should always base their "heard"judgment on what they had heard independentlyof what they had seen and vice versa for the "saw'"judgment. As in the auditory condition, they wereinstructed to guess if they were unsure of thesyllable they had heard or seen on any trial.

In the Tadoma condition, the model stood facingthe CRT screen with the loudspeaker directly infront of her at about waist level. She sequencedthe trials by pressing the return key to initiateeach one. In advance of presentation of theacoustic syllable, the computer printed a syllableon the screen that the model was to mouth. Thenit presented a countdown consisting of a sequenceof two asterisks and then a pair of exclamationpoints (Le., *...*... !!) at 1000 ms intervals. Theexclamation points were presented simultaneouslywith the onset of the acoustic syllable. Betweentrials, the model kept her lips parted and jawslightly lowered. With practice, she learned totime her closing for the mouthed consonant sothat the opening coincided phenomenally withonset of the acoustic signal.

Subjects received no instructions at all on howto distinguish felt "ba" from "ga." Each subjectstood back to the CRT screen, approximately astep farther from the CRT screen than the model.(This was to prevent the subject from looking atthe model's face; in any case, relevant parts of herface were covered by the subject's hand.) The sub­ject stood with his or her right hand in adisposable glove placed over the lips of the model.After some piloting, we found that subjects weremost successful if we placed the the forefinger onthe upper lip and the next finger on the lower lipof the model. This is not the procedure used byTadoma perceivers (who generally place a thumbvertically against the lips); however, it allowedsubjects to distinguish open from closed lips. Afterpresentation of paired felt and heard syllables,subjects indicated to a second experimenter whichsyllable, "ba" or "ga," they had heard and thenwhich syllable they had felt, in each case guessingif necessary. They made their responses by point-

ing to printed syllables on an 8 m x 11 inch sheetof paper held on a clipboard by a second ex­perimenter (the second author). On the sheet ofpaper, the printed syllables "BA" and "GA" ap­peared twice, on the left under the heading"heard" and on the right under the heading "felt."The second experimenter then marked an answersheet analogous to those used in the orthographiccondition (that is with response columns "heard"and "felt"). As in the orthographic condition, sub­jects were told explicitly that synthetic andmouthed syllables were paired independently andthat they should, therefore, make their "heard"judgment based only on what they had heard andtheir "felt" judgment based only on what they hadfelt.

Because the Tadoma procedure was quitetaxing, not only for the model, but also for subjects(whose right arms were extended to the side at themodel's mouth leve}), subjects were allowed tostop and rest at any point during the procedure.Except for rest periods, trials were sequenced bythe model pressing the return key immediatelyafter the second experimenter indicated that shehad recorded the subject's responses.

The model was, of course aware of the syllable tobe mouthed, but not of the synthetic syllable con­tinuum member to be presented on each trial. Theexperimenter who recorded the subject's responseswas blind to the syllable being mouthed, but ableto hear the synthetic syllable. Accordingly, neitherexperimenter had information that could biastheir performance of their respective roles in theexperiment in respect to predictions concerning ef­fects offelt on heard speech.

ResultsResults in the auditory condition for the 12

subjects whose performance identifying "felt"syllables was significantly better than chance arepresented in Figure 1. Performance averaged 90percent "b" judgments for the first two continuummembers and 12 percent "b" judgments over thelast two. In the cross-modal tasks, performanceidentifying the orthographic syllables was, notsurprisingly, near 100%, while performancejudging felt syllables averaged 78.6%.

Effects of the orthographic and felt syllables on"heard" judgments are presented in the top andbottom panels of Figure 2. The top panel shows asmall effect of the orthographic syllable on percent"b" judgments, while the bottom panel shows alarger, more consistent effect of the felt syllables.In an analysis ofvariance with repeated measuresfactors, Continuum number, Cross-modal syllable

Page 6: Listening with Eye and Hand: Cross-modal Contributions to Speech

68 Fowler and Dekle

(balga) and Condition (Orthographic, Tadoma),the effect of Continuum number on percent "b"judgments was, of course, highly significant(F(9,99) = 67.52, p < .001), accounting for 56.2% ofthe variance in the analysis. The effect ofCondition was not significant, but the effect ofCross-modal syllable reached significance as didits interaction with condition (F(1,1l) = 17.22, P =.0017; F(9,99) = 4.85, P = .048, respectively).2 Toexplore the basis for the interaction, we performedtests separately on the data from each modality,setting alpha now at .025. The 17.5% difference inheard "b" responses accompanied by felt "ba"versus "ga" was highly significant (F(1,1l) = 13.70,p < .001) while the analogous 5.8% effect in theorthographic condition was marginal (F(l,l1) =4.52, P = .055). In the original analysis, no otherinteractions reached significance. The main effectof cross-modal syllable and its interaction withcondition remained significant with nearlyidentical F and p values when the analyses were

redone on percent "b" responses averaged acrossthe ten continuum members.

In the next analysis, we looked only at those"heard" trials on which subjects had correctlyidentified the "felt" syllables. Because thisexcluded many trials, we examined performanceaveraged across the continuum. On average,listeners showed a 30.4% difference in favor ofheard "b" judgments for those trials accompaniedby felt "ba" rather than felt "ga." This differenceis nearly double that found by looking at all heardtrials, and it is highly significant (t(11) = 4.25, p =.0014). We did not perform an analogous test onthe orthographic data, because performanceidentifying printed syllables was near 100%.

A further analysis was performed on data fromthe Tadoma condition, now with data from theauditory condition included. If felt "ba" increasesheard "b" judgments, and felt "ga" decreases them,then the means for the auditory condition shouldfall between those for the felt "ba" and "ga" trials.

100

80

fI)-c~ 60Q"g:::I-:a

40=-CGlCo)..Gl~

20

1084 6Continuum number

20+---........--........--.....--.....---.---.....--.....--.....--.....--...,..---

o

Figure 1. Percent "'hIP judgments to acoustic continuum members in the auditory condition, in which acoustic syllableswere unaccompanied by spelled or felt syllables.

Page 7: Listening with Eye and Hand: Cross-modal Contributions to Speech

Listening with Eye and Hand: Cross-modJU Contributions to Speech Perception 69

100

80

!! 60cCDEC)

"=- 40=.c=...CCDU..

20l.

••••............. -0- spelled MBA­....... spelled MGAM

-•••\\

'.•••••••....··-e-····e

108642O+-----..---"""T"-----...-----""T'"--~-__r--.....--r_--

o

Continuum number

~ feltMBAM••••••• felt MGAM

....e.•••··a

•••••••••••••••~..

•••••••••

~.••••..

•~••••~•••••4t

80

100

!!60cCDEC)

".2. 40

P...c~CD 20a.

108642o-t-----.....--""T'"--.----r-----__r-----.,.--.....---,r---~

o

Continuum number

Figure 2. Pen:ent "hlP judgments to acoustic continuum members in cross-modal conditions of Experiment 1. Top panel:orthographic condition; bottom panel: Tadoma condition.

Page 8: Listening with Eye and Hand: Cross-modal Contributions to Speech

70 Fowler tmd De1de

On average, they do, with percent "b" judgmentsaveraging 62.2% overall on felt "ba" trials, 59.6%on acoustic-alone trials and 44.7% on felt "ga"trials. However, while the difference between theacoustic and felt "ga" trials is large and consistent,that between the acoustic and felt "ba" trials isnot. In an analysis of variance with factorsContinuum number and Condition (felt "ba,"auditory, felt "ga"), both main effects weresignificant. However, in analyses performed oneach comparison separately, only the auditory­"ga" difference was significant (F(l,ll) = 22.66, p= .0006); the auditory-"ba" difference did notapproach significance (F<l). However, theauditory-"ba" variable did interact withContinuum number (F(9,99) = 2.09, p = .037), withthe predicted direction of effect occurring just onnumbers 1,4 and 7-10. The failure of felt "ba" todiffer from the auditory condition may signify thatonly felt "ga" gave rise to a McGurk-like effect. Analternative interpretation that we address inExperiment 3 is that blocking the acoustic-alonetrials from the Tadoma trials led to a criteriondifference in classification judgments across theconditions.

In another analysis on heard syllables in theTadoma condition, we examined performance onthe first block of 20 Tadoma trials to see whethera cross-modal influence was present from the verybeginning, when subjects were least experiencedwith the task. This test is of particular interest inthe Tadoma condition, because subjects had es­sentially no prior experience handling the faces oftalkers. We had designed the test order so thatevery continuum member occurred once with felt"ba" and once with felt "ga" in the first block oftwenty trials. This analysis, performed on "b" re­sponses averaged across the continuum, yielded asignificant difference between felt "ba" and "ga"trials (t(ll) =2.84, p =.016). The difference was17.5% in favor of "b" responses, a difference, coin­cidentally, of exactly the same magnitude as thedifference averaged over all 60 trials.

We looked for a reverse effect of heard syllableson felt judgments. (We did not look for an analo­gous effect on orthographic judgments, becauseperformance there was at ceiling.) Massaro (1987)obtained an analogous influence ofheard syllableson judgments of visually-perceived mouthed syl~

lables. In an analysis of variance with factorsContinuum number and Felt syllable (felt "ba" or"ga"), both main effects and the interaction weresignificant (Continuum: F(9,99) = 5.61, p < .001;Felt syllable: F(l,ll) = 61.41, p < .001; interaction:

F(9,99) = 2.08, p = .038). The main effect ofContinuum was significant, because there weremore felt "ba" judgments on trials associated withacoustic syllables at the fbi end of the acousticcontinuum than on trials associated with acousticsyllables at the /g/ end. The main effect of Feltsyllable was significant, because there were morefelt "ba" judgments when the felt syllable was "ba"(85.9%), than when it was "ga" (29%). To explorethe basis for the interaction, we performed sepa­rate analyses on the felt "ba" and felt "ga" trials.In both instances, the effect of Continuum wassignificant; we used trend tests to look for a lineareffect of acoustic continuum number on percent"ba" judgments. Both tests were significant (felt"ba": F(1,99) =8.54, p =.0044; felt "ga": F(1,99) =29.64, p < .001); in each case, felt "ba" judgmentsdecreased as the acoustic syllable shifted from thefbi to the /g/ end of the continuum. They decreasedfrom 95% "b" judgments to 80% on felt "ba" trialsand from 39% to approximately 10% on felt "ga"trials.

DiscussionThe most important outcome of this experiment

is that a strong cross-modal effect on judged spo­ken syllables occurred in one cross-modal condi­tion, while at most a marginal effect occurred inthe other. In particular, a highly significant effectoccurred when mouthed syllables were felt simul­taneously with an acoustic presentation of similarsyllables, but a marginal or nonsignificant effectoccurred when syllables were printed. Accordingto the logic presented in the introduction, we in­terpret this difference as evidence favoringaccounts of speech perception in which perceptualobjects are the phonetically significant vocal tractactions that cause structure in light, air and onthe skin and joints of the hand. By the sametoken, we have shown that mere association in ex­perience between sources of information for asyllable at best gives rise to a weak cross-modalinfluence using this procedure. Shortly weconsider an alternative interpretation of theselatter findings that Experimem 2 is designed toaddress. First, however, we consider some otheroutcomes of Experiment 1.

One interesting outcome is that the cross-modaleffect of felt on heard syllables is present ininexperienced subjects in the very first block oftrials in which they participate. This suggests thatthe effect arises in the absence or near absence ofexperience in which the acoustic and hapticinformation is paired. Accordingly, we conclude

Page 9: Listening with Eye and Hand: Cross-modal Contributions to Speech

Listening with Eye and Hand: CTOSS-modJU Contributions to Speech Perception 71

that joint specification of an environmental eventdoes not require specific learning to be effective inperception. We discuss this conclusion further inthe General Discussion.

A second interesting finding of the experiment isthat the cross-modal effects in the Tadoma condi­tion worked in both directions. Not only did feltsyllables affect judgments of the syllable heard,but the acoustic syllable affected judgments ofthe syllable felt. This suggests considerableintegration of the information from the twomodalities.

There is an alternative interpretation of ourfindings in the orthographic condition. Ourexpectations had been that the Tadoma conditionwould lead to cross-modal effects, while theorthographic condition would not. Accordingly, wedesigned the orthographic condition in a way thatwe thought would optimize its chances of working.That is, we presented the orthographic syllable fora long period of time, guaranteeing that subjectswould see the syllable and hence maximizing thenumber of opportunities for a cross-modal effect tooccur. However, the effect of the long-durationpresentation may have been different than weexpected. Although we asked subjects to look atthe screen as they pressed the return key (so thatthey would see the printed syllable simul­taneously with hearing the acoustic syllable), theyneed not have followed directions, because thesyllable remained on the screen until the nexttrial was initiated. Further, the task of feeling amouthed syllable is difficult and attention­demanding in a way that the task of looking at aclearly-presented printed syllable is not. Oneconsequence may be that the Tadoma task tookattention away from the listening task, leavingthe acoustically-signaled syllables less clearlyperceived and perhaps more vulnerable toinfluence of information from another modality.To address these possibilities, we designed asecond orthographic condition in which printedsyllables were presented briefly and masked.

EXPERIMENT 2Experiment 2 was designed to assess the effect

of attention on the cross-modal influence ofprinted syllables on acoustic syllables. We hopedto use masking to force subjects to look at theprinted syllable at the same time they werelistening to the acoustic syllable and to driveperformance reporting orthographic syllablesdown to a level comparable with the nonrandomsubjects' ability to report the felt syllables inExperiment 1 (78.6% correct).

MethodsSubjects. Subjects were 12 undergraduate

students at Dartmouth College who were nativespeakers of English with normal hearing andnormal or corrected vision.

Stimulus materials. Auditory stimuli were thosedescribed in Experiment 1. The visual stimuliwere also identical to the visual stimuli inExperiment 1, except they were masked by a rowof number signs ("##"). On each trial, a row of fournumber signs appeared just above the location onthe screen where the syllable would be printed.Simultaneous with presentation of the synthetic­speech syllable, either "BA" or "GA" was printedon the screen below the number signs; the printedsyllable was covered over after 67 ms by anotherrow of number signs.3 The mask remained on thescreen until the subject hit the key for the nexttrial.

The 60-trial test orders of continuum membersand orthographic syllables used in Experiment 1were also used here to sequence stimuluspresentations.

Procedure. Each subject was tested individually.As in Experiment 1, the endpoints of the contin­uum were played and identified before the maintest to allow subjects to become familiar with thesynthetic-speech syllables. They were told thatthey could pace themselves through the experi­ment by hitting the return key, and they were in­structed not to leave any blanks on the answersheet, taking a guess if necessary. In addition, asin Experiment 1, subjects were told that acousticsignals and masked visual signals were indepen­dently paired on each trial, and, therefore, deci­sions about them should be made independently.

Results and DiscussionWe succeeded in lowering performance on the

printed syllables to the level of the nonrandomsubjects in the Tadoma condition of Experiment 1.Performance judging orthographic syllablesaveraged 78.3%, very nearly the same level ofperformance judging felt syllables found in theTadoma condition.

The major results of the experiment are de­picted in Figure 3. The figure shows no effect ofthe orthographic syllable on identifications ofheard syllables. An analysis of variance onthe data in Figure 3 revealed only a significanteffect of Continuum number (F(9,99) = 39.87, p <.001). The effect of orthographic syllable wasnonsignificant, with a 4.4% numerical differencebetween means in the unpredicted direction; theinteraction also failed to reach significance.

Page 10: Listening with Eye and Hand: Cross-modal Contributions to Speech

72 Fowler Imd DeJde

100

80

cr masked "BA"••••• masked "GA"

108642O+------""T"-------,------T-------r------.....,.----o

Continuum number

Figure 3. Percent M))" judgments to acoustic continuum membelS in the masked orthographic condition of Experiment2.

In an analysis of variance including the Tadomacondition of Experiment 1 and the results of thepresent experiment, the interaction betweenCondition and Cross-modal syllable was highlysignificant (F(l,22) =17.18, p =.0005).

An analysis of variance on the effect of theacoustic signal on visual judgments yielded twosignificant effects. The effect of orthographic BAor GA was, of course, highly significant, (F(1,1l =319.02, p < .001), accounting for 60.6% of thevariance. In addition, there was a marginal effectof acoustic continuum number (F(9,99) = 1.98, p =.049) that accounted for just 1.4% of the variance.The interaction between continuum and or­thographic BA or GA did not reach significance.We performed a trend test on the effect ofContinuum number to determine whether de­creases in identifications of an orthographicsyllable as BA were associated with decreasingly/hal-like acoustic signals. The analysis, whichtested for a linear decrease in "ha" identificationsacross the continuum, did not approachsignificance (F<1).

We were successful in using masking todecrease identifiability of the orthographicsyllables to a level comparable to identifiability offelt syllables among nonrandom subjects in theTadoma condition. We hoped that thismanipulation would increase attention to thevisual stimulus while the acoustic signal wasbeing presented and would attract attention awayfrom the acoustically-presented syllables. If theseattentional features had been characteristic of theTadoma condition of Experiment 1, but not of theorthographic condition of that experiment and ifthose differences had been the operative ones inthe different outcomes of those conditions inExperiment 1, then a cross-modal effect shouldhave been strengthened in the present ex­periment. Instead, the marginal effect inExperiment 1 disappeared completely inExperiment 2; accordingly, we revert to ouroriginal interpretation of the difference inoutcome. McGurk-like effects occur wheninformation from the two modalities are conjoint,lawful consequences of the same environmental

Page 11: Listening with Eye and Hand: Cross-modal Contributions to Speech

Listening with Eye and Hand: Cross-modal Contributions to Speech Perception 73

event. They do not occur based only on associationin experience.

EXPERIMENT 3We performed a final experiment, in two phases,

to address a variety of methodological concernsraised by reviewers about the haptic condition ofExperiment 1. Two concerns were consequences ofour·having used a loudspeaker rather than head­phones to present the acoustic stimuli. Possiblymouthed syllables were whispered, and the influ­ence on heard syllables in that condition was au­ditory-auditory, not haptic-auditory. In addition,since the model could hear the spoken syllables,perhaps that biased her mouthing actions. Whilewe discount both concerns4 it was easy to allaythem by eliminating the loudspeaker and substi­tuting headphones. Of course, this is likely to re­duce the cross-modal influencing effect as do othermanipulations that decrease the compellingness ofinformation that the cross-modal signals derivefrom the same physical event (see, e.g., Easton &Basala, 1982). However, we considered ouroriginal effect large enough to survive some reduc­tion due to spatial dislocation ofvoice and mouth.

A third concern was that we had provided nomeasure of synchrony between mouthed andaudible syllables; we provided a measure, albeit aphenomenal one, in Experiment 3. A fourthconcern was with the dropout rate among oursubjects. We believed that the poor performancelevels on haptic judgments on the part of somesubjects in Experiment 1 were due to discomfortwith the procedure and the absence of rewards forhigh performance. Accordingly, in Experiment 3,we instituted rewards.

The most serious concern with the findings,however,was that they are consistent with twointerpretations of the cross-modal influence of feltmouthed on heard syllables. One is that the influ­ence reflects a true integration, within a trial, ofinformation from two modalities ostensibly aboutthe same physical event. The other is that infor­mation from the two modalities remains uninte­grated and that, over trials, subjects sometimesselect either the acoustic syllable or the haptic syl- .lable and give it as their response both to theheard and to the felt syllable. We had attemptedto discourage such a response strategy by inform­ing our subjects that we had independently andrandomly paired mouthed and spoken syllablesand that, in consequence, the identification of asyllable in one modality provided no informationabout the identification of the syllable in the othermodality. Moreover, the selection strategy was

equally available to subjects in Experiment 2, butit was not adopted there. We interpreted this asevidence that subjects understood, believed, andcould follow those instructions. Accordingly, weascribed the cross-modal influences that occurredin the haptic, but not the masked visual, conditionas evidence of integration taking place in one setof conditions and not in the other. However, inExperiment 3, we attempted to provide more di­rect evidence for integration rather than selection.

Massaro (1987) provides two kinds ofinformation for distinguishing integration andselection in his tests of the effects of visiblespeaking on identification of spoken syllables. Onekind of information is provided by superadditivityof contributions from the modalities to responseidentifications. In one of Massaro's experiments,on average, subjects correctly identified visible/bal presented with no accompanying acousticsignal on 75% of trials. Further, they identified,for example, the most /bal-like acoustic syllablepresented unimodally as "ha" on just over 80% oftrials. However, the same acoustic continuummember accompanied by visible foal was identifiedas "ha" nearly 100% of the time. When thatpattern is present to a sufficient degree inindividual-subject performances, it rules out oneversion of a selection strategy, because selectionby that strategy will yield an averaging ofidentification percentages weighted by the relativefrequencies with which each modality is selected.The strategy is available to subjects who canprocess just one modality of information on eachtrial. In that case, they have just one response tooffer, and they may give that response as theiridentification of syllables presented in bothmodalities. That strategy will mimic evidence ofcross-modal integration of information except thatthe probability of identifying a given acousticsyllable, say, as "ha" on a bimodal trial cannotexceed a weighted average of the probabilities ofidentifying each unimodal syllable as "ha." In theexample above, performance identifying spokenIbaI to the first acoustic continuum member couldnot exceed 80% on bimodal trials, given that theacoustic syllable was identified as "ha" 80% of thetime and the visible syllable was identified as "ha"less than 80% of the time on unimodal trials.

We looked for evidence of superadditivity in ourdata, but we point out here that it does notprovide strong evidence against selection.5 It isimplausible to assume that subjects can onlyprocess information from one modality at a time.More likely, both modalities provide perceptualinformation on each trial. A possible selection

Page 12: Listening with Eye and Hand: Cross-modal Contributions to Speech

74 Fowler tmd Delcle

strategy, then, is, for example, in identifying theheard syllable, to select a response based on theauditory information (so, in the example, identifythe syllable as "ba" with probability .8) unless thatinformation is noisy; in that case, use theinformation from the other modality (in theexample, choose "ba" with probability .75). Nowthe bimodal probability becomes .8 + (1-.8)*.75 =.95, a value higher than either unimodal value.

A second kind of information for integrationrather than selection is provided by evidence thatthe syllables from the modalities blend in someway. For example, given visible Idal and auditoryImal, an identification of the spoken syllable as"na" blends place information from the visiblesyllable and manner information from the acousticsylIable. In Experiment 3, we also looked forevidence ofblending.

MethodsSubjects. Subjects were ten undergraduate

students at Dartmouth College who participatedfor course credit. Two subjects of the ten were alsopaid for high performance identifying haptic andcontinuum end-point acoustic syllables. Allsubjects reported normal hearing. Data from threeunpaid subjects were eliminated because theirperformance identifying haptic syllables was atchance.

Stimulus materials. Acoustic stimuli were thoseused in Experiments 1 and 2. We created four testorders of 96 trials each. Of the 96 trials, 60 werecross-modal audio-haptic syllables as inExperiment 1. In thirty trials, the ten acousticcontinuum members were presented by them­selves three times each. Six trials were unimodalhaptic trials in which the mouthed syllables BAand GA occurred three times each. Unimodaltrials are needed to test for superadditivity ofbimodal contributions to syllable identification.Trials of the various types were randomized ineach test order.

Procedure. Subjects participated in four sessionsin each of which one of the test orders was used.The order of the different test sequences wasvaried across subjects. Generally, instructionswere the same as in Experiment 1 except, ofcourse that we told subjects that some trials werenot bimodal and that they should just report onesyllable on those trials. An additional change inprocedure was that subjects now wore headphonesover which acoustic syllables were played; theacoustic syllables were, thereby, inaudible to theexperimenters. A final change in procedure, meantto improve performance identifying haptic

syllables is that we invited subjects to usewhatever hand placement on the face made iteasiest for them to identify mouthed syllables.

For the first two subjects we ran, we offered asystem of payments for high performance onhaptic identifications and on identifications ofacoustic continuum endpoints. In particular,subjects could receive $0.25 for each percentagepoint above 75% for haptic identifications andanother $0.25 for each percentage point above75% for identification of acoustic endpoints.Subjects could, therefore, earn a maximum of$50.00 in all across the four sessions.

We also asked these same subjects to give us afinal judgment on each trial in addition toidentification of heard and/or felt syllables. Weasked them to tell us whether the syllables in thetwo modalities were synchronous or not. Inparticular, we told them that they should report"E" if the mouthed syllable led the acousticsyllable, an "S" if it was synchronous and an "L" ifthe mouthed syllable lagged the acoustic syllable.

Both of these latter procedures were abandonedafter two subjects were run. The system of pay­ments was abandoned because performance onhaptic unimodal syllables was near perfect, mak­ing it impossible to test for integration by lookingfor superadditivity. (One subject earned $43.00 ofthe maximum possible of $50.00; the other earned$39.50. On unimodal haptic trials, performanceaveraged 92% correct-too high to look for super­additivity of crossmodal influences on syllableidentification.) The assessment of simultaneitywas abandoned because, although we did not biassubjects to expect simultaneity, they reported si­multaneity virtually all of the time (on 97% of tri­als for one subject and 98% for the other).

Because we feared that superadditivity wouldremain difficult to test for and since we havelearned that it does not provide a strong test ofintegration anyway, we made a final change inprocedure that we hoped would enable us to testfor integration by testing for blend responses. Forthe remaining subjects, we opened the responseinventory in the following way. As we had done inExperiments 1 and 2 and in running the first twosubjects of Experiment 3, we played the acoustic­continuum endpoints to subjects before they beganthe experimental test. In addition, for theremaining subjects, we told them that there wereten syllables in all and that the other eight rangedbetween the /hal and Igal syllables they had justheard. We played them continuum member 5 asan example of an ambiguous syllable. We toldsubjects that we did not know how listeners would

Page 13: Listening with Eye and Hand: Cross-modal Contributions to Speech

Listening with Eye and HAnd: Cross-modal Contributions to Speech Perception 75

identify those intermediate sounds; they mightsound like ambiguous /bals and Igals or they mightsound like other consonants to them, possiblyIna/, Ivai or Idal. We asked them always to reportthe sound they heard, whether it was /bal, Igal orsome other consonant sound. We also gave themto understand that, haptically, while they shouldfeel some clear instances of mouthed /bals andIgals, sometimes they might feel other consonant­initial syllables. On judgments of felt syllables,they should always report the syllable theyexperienced, whether it was /bal, Igal or someother consonant-initial syllable.

Because, in fact, our procedural changes madelittle difference in subjects' performances, for mostanalyses, we have pooled the findings on allsubjects who completed the four sessions. (Threesubjects of the eight unpaid subjects were droppedafter one or two sessions for chance performanceon haptic trials. Thus, after abandoning oursystem of payments, performance that had been atceiling on unimodal haptic trials and was veryhigh on bimodal trials, reverted to a level just 3%higher than its level in Experiment 1. We believethat subjects can do the haptic task if they aremotivated to attend.)

ResultsFigure 4 presents the judgments of heard sylla­

bles on the auditory only (top) and cross-modal(bottom) trials of Experiment 3. In an analysis ofvariance on cross-modal trials, with factorsContinuum number and Haptic syllable, bothmain effects were significant (continuum: F{9,54)= 74.60, p < .0001; haptic syllable: F(l,6)= 16.29, p= .007). The interaction did not approach signifi­cance (F<1). While the effect of haptic syllable wasnumerically smaller than in Experiment 1, per­haps due to the spatial dislocation of cross-modalsyllables, it is present on nearly every syllablealong the continuum and is highly significant.

Another reason for the reduced cross-modaleffect as compared to its magnitude in Experiment1 might be that subjects learned over sessions todivide their attention across sessions. To look foreffects of practice, a further test was performed,now on the percent "batt responses averaged acrossthe continuum and now including auditory-alonetrials, with session as an independent variable. Inthe analysis, there was an effect of haptic syllable(BA, GA, none; F(2,12) =4.81, p =.029) and aneffect of block (F(3,18) = 3.20, p = .048); however,the interaction did not approach significance(F<l). As in Experiment 1, "batt responses toauditory-alone syllables (49.5%) fell between

responses to syllables accompanied by mouthedBA (54%) and responses to syllables accompaniedby mouthed GA (49%). Now, however, they fellclosest to (and very close to) performance onmouthed GA trials, rather than mouthed BA trialsas they had in Experiment 1. The effect of sessionsoccurred because the percentage of "batt responsesdeclined monotonically from session 1 (54.8%) tosession 4 (48.7%). However, the size of the cross­modal effect did not change across sessions, andthe numerical change was in the direction of anincreased, not a decreased, effect with practice.

We looked also at the effect of acoustic syllableson haptic judgments. We compared the percentageof "batt judgments to mouthed BA and GA whenthey were accompanied by acoustic syllables fromthe /bal-end (items 1-5) or the Igal-end (6-10) ofthe continuum. Table 1 gives the results. For bothmouthed syllables, identification of the feltsyllable as "batt was more likely when themouthed syllable was produced synchronouslywith a /bal-like acoustic syllable than when it wasproduced with a more Igal-like syllable. Ananalysis of variance was performed on the data inTable 1, arc-sine transformed to eliminatevariance differences due to approaches to ceilingfor mouthed BA at the lbaI-end of the continuumand to floor of mouthed GA produced at the Igal­end of the continuum. The effect of haptic syllablewas, of course, highly significant (F(l,6) = 45.69, p= .0007). The effect of acoustic syllable was alsosignificant (F(l,6) = 6.42, p =.04) as was theinteraction (F(l,6) = 9.72,p = .02). The interactionoccurred because the effect of continuum was28.7% for haptic GA but only 7.7% for haptic BA,where performance was very close to ceiling.

We tum now to tests of integration. Generally,the test for superadditivity was not applicable tosubjects, because their performance identifyinghaptic syllables was at ceiling on unimodal trials.In particular, none of our seven subjects was lessthan 92% correct in identifying haptic BA as suchon haptic alone trials; five subjects made noerrors. Obviously, we cannot look forsuperadditivity of effects of haptic and acousticinformation using haptic- and acoustic-alone trialsto predict performance on bimodal trials. We donot know why Massaro's subjects performedrelatively poorly (75% accurate in his Figure 7, p.65 of Massaro, 1987) identifying /bal in a video­only condition, so that evidence of superadditivitycould be sought. By the same token, three of ourseven subjects made no more than 8% errorsidentifying Igal on haptic alone trials. We lookedat performances of the remaining four subjects.

Page 14: Listening with Eye and Hand: Cross-modal Contributions to Speech

76 Fowler tmd Dekle

100

80

.!c

l60'tJ::s­=.a= 40-c8..CD~ 20

1084 6

Continuum number

2

O+-----,,....---r--_-___r---r----r--...,...-.......~----r-~&_--o

felt "BAlifelt IIGAII

.... '-D-............ ...•..,••••••••••.~••••••••

•••••••-•••••••

•••...•••••

80

100

10864

Continuum number

2O+---.....--r----.--~--.--"""T""----r--~~....~,....-.....

o

Figure 4. Percent .".. judgments to acoustic continuum members on the auditory-alone trials (top) and cross-modalhials of Experiment 3.

Page 15: Listening with Eye and Hand: Cross-modal Contributions to Speech

listening with Eye and HJznd: Cross-modJz1 Contributions to Speech Perception 77

Table 1. Percent "00" responses to haptic BA and GAon trials in which 1IWuthed syllables were producedsimultaneously with synthetic syllables from the lbo/­end (continuum numbers 1-5) and Igo/-end (6-10) olthecontinuum.

Of them, two showed some evidence ofsuperadditivity. Across haptic and acousticidentifications on trials involving GA as amouthed syllable, there are 20 opportunities toshow superadditivity ofhaptic-alone and acoustic­alone contributions to bimodal identification of themouthed or acoustic syllables. Two of the foursubjects showed superadditivity on six of the 20trials; of the remaining two subjects, one showedsuperadditivity on two trials and one on none. Wedo not know how to evaluate this outcomestatistically. (In Massaro's grouped data (p. 65)four of nine continuum members showsuperadditivity relative to video-alone /bal, whilenone do so relative to video-alone Idal whereperformance is at ceiling; in his sample individualsubject data, two continuum members showsuperadditivity relative to video-alone /bal whilenone do relative to video-alone Idal, where, again,performance is at ceiling. Massaro evaluatedsuperadditivity by testing his model, in whichbimodal influences are integrative and hencesuperadditive against a competing, selection,model. Since we reject Massaro's model as asource of cross-modal haptic influences onauditory speech perception and vice versa, we didnot copy his procedure.)

A second test for integration involves tests forblends of information presented in the differentperceptual modalities. To enable us to test forblends we opened the response inventory forsubjects after the first two subjects that we testedin Experiment 3. However, just one subject of theremaining five used a response other than "b" or"g" on more than a dozen occasions. The onesubject used "va" as a frequent response both forheard- and for felt-syllable judgments. Here weconsider the patterning ofhis "va" responses.

Since /bal and /gal are articulatory extremes inEnglish, almost any other consonantal responseother than "b" or "g" is intermediate between /baland /gal and hence looks, on the surface, like ablend. To attempt to determine whether the

Table 2. One subject'sfrequency 01 "va" responses toacoustic (left) and haptic (right) syllables on bimodaltrials, when the synthetic syllables were from the ba- orthe ga-ends 01 the continuum. Middle continuumnumbers (5 and 6) were disregarded.

His judgments of both acoustic and hapticsyllables show the predicted pattern. That is,when BA is the mouthed syllable, "va"identifications of both the acoustic and the hapticsyllables are more likely when the acousticsyllable is from the Igal- than from the lbaI-end ofthe continuum. The pattern is reversed when themouthed syllable is GA Neither table hassufficiently high frequencies for a Chi Squareanalysis; however, a pooled table does. In ananalysis, then, of the likelihood of a "va"identification of either a haptic or acoustic syllablewhen haptic and acoustic syllables pull in thesame or opposite directions, the Chi Square valueis highly significant (x2 =8.54, p =.0036). Basedon this analysis, we can conclude that there existsat least one person for whom the influence of

4

53

14

Judged relt syllableba-end ga-end

65

10

Judged heard syllableba-end ga-end

HapticGA

HapticBA

subject who gave frequent "va" responses wasgiving true blends or, alternatively, was simplyguessing giving another consonant name, welooked for a systematic patterning in "va"responses. We reasoned that, if"va" responses areblends, then when BA is the mouthed syllable,"va"s should be given as an identification of eitherthe heard or the felt syllable more frequentlywhen acoustic stimuli are at the /gal-end of thecontinuum than when they are at the /bal-end.That is, as the haptic syllable pulls the responselbaI-ward, the acoustic syllable pulls it Igal-wardyielding a blend response. Similarly, when GA isthe mouthed syllable, "va" identifications shouldbe more frequent when acoustic syllables are atthe /bal- than at the Igal-end of the continuum.For purposes of the analysis, we eliminatedmiddle-continuum syllables 5 and 6 from theanalysis and examined identifications of acousticand haptic syllables when acoustic syllables were/bal-like (continuum numbers 1-4) or Igal-like (7­10). Table 2 shows the results for the singlesubject who gave a considerable number of "va"responses.

7.0

87.6

Igal-end

95.4

35.7

Ibal-end

HapticBA

HapticGA

Page 16: Listening with Eye and Hand: Cross-modal Contributions to Speech

78 Fowler and Dekle

haptic on acoustic judgments and vice versa is atrue integration and not a selection. Possibly, twoother of the seven subjects show evidence forintegration in the form of superadditivity of cross­modal influences as well.

GENERAL DISCUSSIONOur findings in the Tadoma condition of

Experiment 1 and in Experiment 3 suggest thatperception need not be based on covertanticipations of the range of sensory cues thatmay be experienced from stimulation nor onassociations between those cues andrepresentations of their environmental causes.Together with results of the orthographicconditions of our experiments, the findingssuggest, indeed, that stored associations are notsufficient for cross-modal integration ofinformation for an event and that association inthe world, specifically due to (ostensible) jointcausation by a common environmental event, isrequired.

There is another indication, in our view, thatperception of speech syllables does not requireprior existence in memory of a prototype or someother way of associating sensory cues to repre­sentations of perceivable events. In a McGurkprocedure, when the acoustic syllable is IdaJ whilethe model mouths /baI, listeners frequently reporthaving heard Ibda/ (e.g., Massaro, 1987). InMassaro's fuzzy-logical model of perception(FLMP), a syllable is reported by listeners if it isrepresented by a prototype in memory, and ifevidence consistent with that prototype is strongerthan evidence consistent with other prototypes asdetermined by Luce's choice rule (1959). Indiscussing the definition of prototypes in themodel, Massaro refers to a prototype for /bdaJ(1987, p. 128). But how could a hearer acquire aprototype for /bdaf! /bdaJ is not a possible syllablein English. Indeed, it violates the language­universal sonority constraint (roughly thatconsonants in a within-syllable cluster mustincrease in vowel-likeness toward the syllable'svowel; e.g., Selkirk, 1982). Listeners will not haveexperienced /bdaJ prior to the experiment; nor willevolution have provided a prototype for auniversally disallowed syllable.6 The onlypossibility, it seems, is that prototypes can beconstructed "on the fly" in the experiment. That is,there must be a way for the perceiver to decidethat no existing prototype is sufficiently supportedby the evidence in stimulation. In that case, a newprototype is established and named. However, ifthe information in stimulation is sufficient to

reveal that a new prototype is needed and to namethe prototype /bdaJ, then, it seems, the prototypeitself is not needed for perception. Rather, itdepew on perception for its establishment.

We also doubt that our outcome is compatiblewith the motor theory. Liberman and Mattingly(1985) invoke an innate vocal-tract synthesizerthat subserves both speech production and, usingsomething like analysis-by-synthesis, speechperception as well. There is reason to suppose thatan innate synthesizer would have anticipated thepossibility of receiving optical as well as acousticinformation about vocal-tract gestures since bothof those information sources are commonlyavailable in the environment of listeners; abilityto exploit the information sources might haveadaptive significance. However, there is no reasonto suppose that selection would have favoredevolution of a synthesizer that anticipatedreceiving haptic information provided by thehands of a listener. More generally, we doubt thatany explanation for the Tadoma effect can workthat depends on the significance of the hapticinformation being appreciated either because thesignificance has been learned or because it isknown innately.

How can a new environmental occurrence beperceived, or likewise in our experimentalsituation, how can a familiar occurrence besignaled effectively, in part by novel proximalstimuli? We follow Shaw, Turvey, and Mace (1982)in concluding that perceptual experience isfundamentally knowledge acquired due to the"force of existence- of events in the world. In ourterms (and not necessarily those of Shaw et a1),environmental events causally structure mediasuch as light, air and the skin and joints of aperceiver. To the extent that the structure isdistinctive to its causal source, it can serve asinformation about the source. The information caninform without prior familiarity with it because ofthe causal chain that supports perception.Stimulation caused by an environmental eventhas causal effects on sensory receptors so that itsstructure is, in part, transmitted to a perceptualsystem. By hypothesis, the perceiver comes toknow an event in the environment via its impacton the perceptual systems as transmitted byproximal stimulation.

In that sense, perception may not be aparticularly intellectual endeavor at all; it may bemore analogous to motor accommodations to, ormotor exploitations of, physical forces exerted onthe bodies of actors. In the analogy, proximalstimuli at the sense organs are the informational

Page 17: Listening with Eye and Hand: Cross-modal Contributions to Speech

Listening with Eye lind HIInd: Cross-modII1 Contributions to Speech Perception 79

analogs of the physical forces impinging on thebody of an actor. We need not learn what hapticconsequences of environmental events mean inorder to perceive their source any more than wehave to learn what gravity means in order to beaffected by it in a coherent way. The role oflearning, then, may be to change theperceiver/actor's state of preparedness forreceiving and exploiting the "forces," both physicaland informational, that the world has to offer, notto discover what environmental events the forcessignal. That is given in the stimulation. Even inthe absence of relevant learning by theperceiver/actor, the environment exerts its sameforces that causally affect the body, including theperceptual systems.

REFERENCEScampbelL R. (1989). Seeing speech is special. Behllt7ionllllnd B11Iin

Sciences, 12, :758-759.Chomsky, C. (1986). Analytic study of the Tadoma method:

Language abilities of three deaf-blind subjects. JounJIIl ofSpeechtmd Heflring Reserlrch, 29, 332-347.

Easton, R., &r Basala, M. (1982). Perceptual dominance duringlipreading. Perception lind Psychophysics, 32, 562-570.

Erber, N. (1969). Interaction of audition and vision in therecognition of oral speech stimuli. JourntU ofSpeech lind HeflringReserlrch, 12, 423-425.

Ewertsen. H., &r Nielsen, H. B. (1971). A comparative analysis ofthe audiovisual, auditive and visual perception of speech. ActllOtolilryngoliefl, n, 201-205.

Fowler, C. A. (1986) An event approach to the study of speechperception &om a direct-realist perspective. JOW7UII ofPhonetics,14,3-28.

Gibson, J. J. (1966). The senses considered /IS perceptw,z systems.Boston: Houghton-Mifflin.

Gibson, J. J. (1979). The ecologiCilI IIpprotIch to t1isw,z perception.Boston: Houghton-Mifflin.

Keller, H. (1903). The story ofmy life. New York: Doubleday.liberman, A. (1982). On finding that speech is special. AmeriCiln

Psychologist, 37,148-167.Liberman, A., Cooper, F., Shankweiler, D., &r Studdert-Kennedy,

M. (1967). Perception of the speech code. PsychologiCilI Rn7iew,74,431...(61.

liberman, A., &r Mattingly, I. (1985). The motor theory revised.Cognition, 21, 1-36-

Luce, D. (1959). Indit1idUAl choi« behlltlior. New York: Wiley.MacDonald,. J., &r McGurk, H (1978). Visual influences on speech

perception processes. Perception lind Psychophysics, 24, 253-257.Massaro, D. (1987). Speech perception by eflr lind~ A pIIrruUgmfor

psychologiCilI inquiry. Hillsdale, NJ: Lawrence ErlbaumAssociates.

Massaro, D. (1989). Multiple book review of Speech perception byeflr lind~ A pllradigm for psychologja,z inquiry. BehllrJiorlllllndB11Iin Sciences, 12, 741-755.

Massaro, D., Cohen, M., &r Thompson, L. (1988). Visible languagein speech perception: lipreading and reading. Visible LAngtMlge,22,8-31.

McGurk, H., &r MacDonald, J. (1976). Hearing lips and seeingvoices. Nlltun, 264,746-748.

Rosenblum. L. (1987). Towards an ecological alternative to themotor theory of speech perception. Perceiuing-Acting WoricshopRerIiew, 2, 25-28.

Schultz. M., Norton, S., Conway-Fithian, S., &r Reed, C. (1984). Asurvey of the WIl! 01 the Tadoma method in the United Statesa1d Cmada. Volt. RerIiew, 86, 282-292.

Seidenberg.. M.,. &r Tanenhaus, M. (1979). Orthographic effects onrhyme monitoring. Josmud of F.%perimentlll Psychology: HlDflIInUrmdng lind Memory, 5, 546-5St.

Selkirk, Eo (1982). The syllable. In H. van der Hulst &r N. Smith(Eds.), The~ofphonologial1 rqwtsentlltions, Vol 2 (pp. 337­384). Dordrecht. The Netherlands: Faris Publications.

Shaw, R., Turvey, M. T., &r Mace, W. (1982). Ecologicalpsychology: The COIIleqUence 01 a commitment to realism. InW. Weimer &r D. Palermo (Eds.), Cognition lind the symbolicprocessa, 2 (pp. 159-226). Hillsdale, NJ: Lawrence EribaumAlI&oc:iates.

Sumby, W. H, &r PoIlaclc.l. (19M). Visual contribution to speechintelligibility in noise. JoumIIl ofthe AcoustiClll Society ofAmerial,26, 212-215.

Summerfield,. A. Q. (1987). Some preliminaries to aoomprehensive account of audio-visual speech perception. In B.Dodd &r R. Campbell (Eds.), Heflring by eye: The psychology oflip­muling (pp. 3-51). London: Lawrence Erlbaum Associates.

Summerfield, A. Q., &r McGrath. M. (1984). Detection andresolution of audio-visual incompatibility in the perceptionof vowels. QuIIrterly JOIlmIII of Experimentlll Psychology, 3M,51-74-

Tanenhaus, M., Flanigan, H., &r Seidenberg, M. (1980).Orthographic and phonological activation in auditory andvisual word recognition. Memory tmd Cognition, 8, 513-520.

FOOTNOTES·Joumlll of Experimentlll Psychology: Humtln Perception lind

Perfumuma, 17, 816-828 (1991).tAlso Dartmouth College.

t1Dartmouth College.1As we discuss later, we attempted to bias the experiment in

favor of the orthographic condition by presenting the printedsyllable for a long period of time. Possibly, however, this was adesign flaw in that the printed syllables failed to demand asmuch attention as the haptically-perceived syllables. Weprovided a more attention-demanding orthographic oonditionin Experiment 2.

2In an analysis including all 22 subjects, the main effects ofContinuum number and of Cross-modal syllable remain highlysignificant, while the critical interaction is marginallysignificant (F(1,21) =4.16, P=.052). The difference in heard "'b'"responses as a function of feeling mouthed "'ba'" versus '"ga'" is12.6%; as a function of seeing printed "ha'" versus '"ga'" it is5.6%. We performed this test including random subjects fortwo reasons. One was a discomfort with eliminating nearly halfour sample of subjects in the main analyses. A second was asense that, because subjects made the '"felt'" judgments second,.some forgetting might have occurred. If so, their performanceon the felt judgments may have tmderestimated both their trueability to identify felt "'ha'" and '"ga'" and the influence that thefelt syllable might have had on the heard syllable.

3nte stimulus onset asynchrony (SOA) was determined by pilottesting subjects until we found an SOA at which performanceidentifying the spelled syllables was approximately that of thehaptically-identified syllables of Experiment 1. The SOA atwhich performance approached that on the Tadoma conditionof Experiment 1 was 67 DlS. Occasional trials may have hadSOAs of 83 DlS, however, because we were unable to controltarget and mask presentations relative to the occurrence ofscreen refreshes.

"In Experiment 1, only supraglottal actions of the model mim­iclced those of naturally-spoken lbal and Iga/; there was no

Page 18: Listening with Eye and Hand: Cross-modal Contributions to Speech

80 Fowler and Dekle

channeling of air through the oral cavity and, accordingly,there was no whispering. As for biasing effects on the model ofhearing the synthetic syllables, two factors precluded that.First, to make the movements, particularly of /ba/,synchronous with the synthetic syllable-a condition wepresumed important to integration (cf. Easton Ie Basala,1982)-closing movements for the consonant had to be initiatedbefort the oRiel of the signal so that articulatory release, wherethe onset of acoustic energy begins for stop consonants, co­occurred with the stop burst and vowel onset. Second,however, were the model to delay mouthing the syllable untilthe identity of the spoken syllable was detectable, no otherbehavior than following instructions to mouth the syllableprinted on the CRT screen could have led subjects to shift their

judgments of the heard syllable in the direction of the mouthedsyllable.

SWe thank George Wolford for pointing this out.6While in rapid speech production. certain "be-" words (such as"'beneath.. or "becomej may by reduced so that the first vowelis inaudible, there are no entries in the dictionary whereby suchreduction would lead to a /bda/ syllable. (Readers of themanuscript have suggested the words "bidet," "bedazzle,""'bedecked," and "'bedevil," however, they should recall thatMassaro's memory representations are syllllble prototypes, notconsonant prototypes. Were the schwa vowels dropped in theforegoing words, the remaining syllables would be /bdey/,/bdaez/, jbd£lt/ and /bde:vl not /bda/ (/a/ being the firstvowel in "fatherj.