the use of interactive input in eap listening assessment

The use of interactive input in EAPlistening assessment

John Read*

School of Linguistics and Applied Language Studies, Victoria University of Wellington, PO Box 600,

Wellington, New Zealand

Abstract

The assessment of listening comprehension for academic purposes is an area which has notreceived much attention from researchers. This article focuses on the form of the input forEAP listening tests. While there is a great deal of interest currently in the use of visual media

for listening assessment, it is likely that tests with purely auditory input will continue to have asignificant role. The article reports on the development of a test in two audiotaped versions: ascripted monologue and an unscripted discussion of the same topic by three speakers. The testwas administered to two matched groups of learners taking an intensive pre-sessional EAP

course. In contrast to the results of an earlier study by Shohamy and Inbar [Langugage Test-ing 8 (1991) 23], it was found that the monologue version was significantly less difficult thanthe discussion. Various possible reasons for the difference in findings are presented and the

article concludes with a consideration of what can be learned from the research for the designof listening test tasks with interactive input.# 2002 Elsevier Science Ltd. All rights reserved.

1. Introduction

Listening is an important skill for learners of English in an academic study con-text, since so much of what they need to understand and learn is communicatedthrough the oral medium. Although this skill often functions as an integral part oftwo-way oral interaction, students routinely find themselves in what Buck (2001: 98)calls ‘‘non-collaborative’’ listening situations, in which they listen to a lecture,seminar presentation or tutorial discussion without having the opportunity to par-ticipate orally themselves. Thus, non-collaborative listening tasks are an establishedcomponent of EAP proficiency tests and are likely to remain so. In the design of

Journal of English for Academic Purposes

1 (2002) 105–119

www.elsevier.com/locate/jeap

1475-1585/02/$ - see front matter # 2002 Elsevier Science Ltd. All rights reserved.

PI I : S1475-1585(02 )00018 -8

* Tel.: +64-4-463-5606; fax: +64-4-463-5604.

E-mail address: [email protected] (J. Read).

http://www.elsevier.com/locate/jeap/a4.3d

mailto:[email protected]

such tests, a common practice is to take the university lecture as the prototypicaltarget language use situation (in terms of Bachman & Palmer’s (1996), frameworkfor test design). This means that the stimulus material has the form of a talk or mini-lecture on a general academic topic and the learners are required to demonstratetheir comprehension of the content by responding to a set of test items.However, if we set out to design such a test from scratch, we are confronted with

numerous questions:

� Should the test include other target language use situations besides the lec-ture, such as the seminar discussion or individual consultation with a lec-turer?

� Should the stimulus material be taken from a genuine lecture, and if not,should it be scripted or not?

� How should differences among the learners in background knowledge of thetopic(s) be addressed?

� Should the test input be delivered live, on audiotape, or through visual medialike videotape and multimedia packages?

� Should the stimulus material be heard only once, or should it be repeated?� Should the test-takers be allowed—or expected—to take notes while they are

listening?� To what extent should the response task be a ‘‘pure’’ measure of listening

comprehension, involving only a minimum amount of reading or writing?

Undoubtedly the questions are answered in various ways in tests used by EAPprogrammes around the world. In addition, Buck’s (2001) book Assessing Listeninggives a comprehensive overview of the issues and a wealth of practical advice on testdesign, drawing on relevant research where it is available. However, the publishedresearch studies on listening assessment are limited in number, and those involvingEAP tests are even fewer. This means there is plenty of scope for further research onwhat Bachman and Palmer (1996) call the ‘‘task characteristics’’ of listening com-prehension tests.Some researchers have looked at choice of topic, and more specifically the role of

background knowledge of the topic on listening test performance, since its effect onreading comprehension is well documented. Chiang and Dunkel (1992) comparedtheir Taiwanese students’ understanding of a talk in English on a culturally familiartopic (Confucianism) with an unfamiliar one (the Amish people in the USA) andfound a significant effect, although it manifested itself only in test questions whichcould be answered without reference to the passage. A more clear-cut result wasobtained by Schmidt-Rinehart (1994), who found that her American students ofSpanish understood a topic already covered in their textbook much better than onewhich they had not studied in Spanish before. In an EAP setting in the USA, Jensenand Hansen (1995) produced inconsistent results indicating that students who hadpreviously studied the topic of a lecture could comprehend it better in some cases,particularly if the topic was more technical in nature. Thus, the evidence for a sig-nificant background knowledge factor is mixed. As Jensen and Hansen point out,

106 J. Read / Journal of English for Academic Purposes 1 (2002) 105–119

one of the difficulties in obtaining more conclusive results is to find an adequateoperational measure of how much the test-takers know about the listening test topic.Research has also focused on the type of test items used and how they relate to the

input text. Shohamy and Inbar (1991) found that ‘‘local’’ questions on factualdetails and the meanings of lexical items were easier to answer than ‘‘global’’ ques-tions on main ideas and inferences, while ‘‘trivial’’ questions on incidental numbers,dates and names yielded inconsistent responses. Two TOEFL research studies haveanalysed how various features of the short dialogues (Nissan, DeVincenzi, & Tang,1995) and the minitalks (Freedle & Kostin, 1999) used as stimuli in the TOEFL lis-tening section influence the difficulty of the multiple-choice items based on them.Freedle and Kostin concluded that, while characteristics of the test items themselveshave some effect, it is text and text/item overlap variables which are the most sig-nificant predictors of item difficulty.Another aspect of listening test design that has received some attention is the issue

of when to present the test items. Both Groot (1975) and Sherman (1997) found thatthe test scores were essentially the same whether the questions were given after thelearners had listened to the stimulus material rather than before. However, Sher-man’s experiment also showed that significantly higher scores were achieved with a‘‘sandwich’’ format, whereby the test-takers listened to the input text once, weregiven the questions and then heard the text a second time. It was also the formatwhich created the least anxiety and was strongly preferred by the test-takers overboth the ‘‘questions-before’’ and ‘‘questions-after’’ conditions.One area which is now being addressed is the form of the input. The standard

method of presenting the stimulus material for a listening test, especially large-scaleand high-stakes ones, is by means of a pre-recorded audiotape. However, given theroutine use of videos and other visual material in the contemporary language class-room, there has been increasing interest in providing various forms of visual input inlistening assessment. Already the listening section of the computer-based TOEFLtest incorporates photos to accompany each of the dialogues and mini-talks (Gin-ther, 2001). In addition, several writers, notably Gruba (1997), have advocated theuse of visual media in language assessment and there has been research to investigatethe use of video input for listening comprehension tests (Coniam, 2001; Gruba,1993; Progosh, 1996).The results of the research have been rather variable. Progosh (1996) elicited very

positive attitudes to video-based listening tests among the Japanese learners he sur-veyed. In her TOEFL study, Ginther (2001) found that test-takers generally pre-ferred listening input with accompanying visuals, and the pictures had some positiveeffect on comprehension when they provided content information for mini-talks orassisted the test-takers to identify the various speakers in an academic discussion.On the other hand, Gruba’s (1993) results showed that a video version of a simu-lated lecture had no measurable effect on students’ test performance, as compared toa purely audio version. In a similar kind of comparative study involving Englishteachers in Hong Kong, Coniam (2001) found that if anything the group listening toan audio version of an educational discussion understood it better than those whotook the video version. In addition, over 80% of the video group considered that the

J. Read / Journal of English for Academic Purposes 1 (2002) 105–119 107

video had not helped their comprehension at all, over a third of them (almost) neverlooked at the screen during the test, and the majority expressed a preference foraudio as a listening test medium. Such reports are not uncommon in more informalaccounts of trials with visual input of listening tests in various parts of the world.Clearly, further investigation and conceptual development is required to deter-

mine how best to employ visual material for listening assessment, in keeping withthe construct definition of listening comprehension that is appropriate for particulartesting programmes. In the meantime, for practical as much as for more principledreasons, listening tests with only auditory input will continue to have a prominentrole for the foreseeable future. This means that there is still scope for research onvarious characteristics of listening test input based on audiotaped texts.One substantial contribution to such research was the study by Shohamy and

Inbar (1991), who focused on the degree to which listening test stimuli have thefeatures of natural speech, as distinct from formal written language. They developedthree versions of a listening test in English for Israeli high school students, each witha different text type: a scripted monologue simulating a news broadcast, a lecturettein which the speaker interacted with an addressee, and a consultative dialoguebetween an expert and an addressee. The versions were designed to represent distinctpoints along a literate to oral continuum. According to the results, the monologuewas significantly more difficult to understand than the two more ‘‘oral’’ text types.Shohamy and Inbar pointed out that the three texts were distinguished by numerousdiscoursal and pragmatic features, such as the density of the propositions, theamount of repetition and redundancy, and the degree of grammatical complexity.Although the authors saw the three types of input as being ranged along a con-tinuum, the essential difference appears to be the interaction between speaker andaudience in the lecturette and dialogue, as compared to the one-way communicationof the news broadcast. Shohamy and Inbar’s conclusion was that a variety of texttypes should be used in a listening comprehension test, depending on the purpose ofthe test and the learners’ listening needs.The present study can be seen as exploring the implications of Shohamy and

Inbar’s findings in an EAP context. It was undertaken as a joint research projectbetween the applied linguistics programmes of the University of Melbourne, Aus-tralia and Victoria University of Wellington, New Zealand. Both institutions hadexisting academic listening tests as part of their institutional EAP proficiency testbatteries. In Melbourne the whole test was used primarily to assess the Englishproficiency of international students on arrival at the university. The listening testwas based on an audiotaped talk, with short-answer questions. On the other hand,in Wellington the proficiency test was administered towards the end of a 3-monthintensive EAP course as a basis for reporting on the students’ proficiency to theuniversity admissions office, the sponsoring agency (where applicable) and the stu-dents themselves. In this case, the listening test incorporated two scripted talks thatwere presented live to all the students assembled in a large lecture theatre.At both institutions there was a desire to explore alternative formats for listening

assessment. Initially the intention was to develop a new form of the Melbourne testusing videotaped input and compare it with an audio version based on the same


content. However, two main considerations caused us to change tack. One was theforce of Gruba’s (1997) argument against the meaningfulness of comparisonsbetween tests using different input media. His position is that each medium needs tobe evaluated separately, in its own terms. The other was a more practical concernabout the cost of good-quality video production, especially since parallel forms ofthe video-based test would be required on an ongoing basis if it were to be incor-porated into the operational test battery. Instead, the decision was made to comparetwo forms of audiotaped input: a scripted monologue and a unscripted discussion ofthe same content by three speakers.Thus the aims of the research were first to investigate the feasibility of developing

a suitable audiotaped discussion and then to compare it as input material for anEAP listening test with a conventional scripted monologue on the same topic.

2. The study

2.1. Development of the test material

The initial step in designing the test was to identify a topic that was appropriatefor academic study purposes and would lend itself to discussion from more than onepoint of view. The topic chosen was ethical issues in medical research. To providethe basis for the test material, we used two case studies involving ethically dubiouspractices which had received prominent media attention in our two countries. Thefirst one was the subject of a series of investigative articles in a Melbourne news-paper, which reported how in the late 1940s and early 1950s medical researchersdeveloping vaccines for childhood diseases had conducted trials in babies’ homesand orphanages. The trials had been largely unsuccessful in preventing infection andthe issue was whether the researchers were justified in carrying out tests of unprovenmedicine with young children separated from their families. The second case wasone that had created considerable controversy in New Zealand in the mid-1980swhen it was revealed that a professor of gynaecology at an Auckland hospital hadsought to challenge the then-conventional wisdom that a condition called carcinomain situ (CIS) was a precursor of cervical cancer, by withholding from numerouspatients the normal treatment for CIS (namely, a hysterectomy) in order to seewhether cancer would develop. The women were not informed either that they hadCIS or that they were participants in an experiment.From a present-day perspective, the two cases raised similar issues: the ethics of

experimentation on patients using unproven or controversial treatments; the need toprotect the rights and interests of vulnerable members of society; and the necessityof obtaining from patients or their guardians informed consent to the treatmentproposed by the doctors.A script was written for the monologue version of the test in four sections. The

first part presented the case of the Melbourne vaccine trials, followed by a sectionwhich discussed the ethical issues arising from it. The third and fourth parts did thesame for the cervical cancer case in Auckland. The preparation of the script enabled


us to clarify what factual information needed to be presented and the ethical issuesthat were involved, as well as to identify the unavoidable technical terms which wererequired for the discussion of the topic. The script was written in the relatively for-mal style of a broadcast talk on public radio, without inserting features of informalspeech, such as hesitations, false starts or fillers such as ‘‘you know’’, ‘‘you see’’ or‘‘Now, ..’’. In this respect, it was similar to the stimulus material for the existingMelbourne listening test.Next the discussion version of the test was developed. An initial attempt to make a

recording involved a free-wheeling discussion of the two medical cases by threespeakers. This proved to be unworkable because the conversation lacked a cleardiscourse structure, assumed too much shared knowledge and did not lend itself tothe writing of suitable test questions. Thus, a second, more carefully planned versionwas prepared. It was designed to match the four-part structure of the monologue,while at the same time simulating a university tutorial rather than a scripted lecture.The three speakers had read the source articles on the two cases, as well as themonologue script, and they planned in advance how the discussion would proceed,but it was not scripted at all. In order to make it easier for listeners to distinguish thespeakers, each took a distinct role. One, a middle-aged woman, took the tutor’s role,introducing each section of the discussion, allocating turns, clarifying certain pointsand summing up as necessary. The other two younger speakers—a female and amale—acted as students, who each summarised the facts of one of the cases and thendebated the ethical issues. Apart from their gender, the students were distinguishableby the fact that the female took a more critical stance towards the medicalresearchers, whereas the male tended to put the case for the doctors and to defendtheir actions in relation to the prevailing ethical standards at the time the researchwas conducted.In addition to the spoken input, a written text of about 500 words was prepared.

Its purpose was to orient the test-takers to the subject matter before they listened tothe spoken test material. The text gave general information about the topic, withsome background on each of the two cases but without revealing any of the specificinformation needed to respond to the test items. Some key terms were highlighted inbold font and defined in the text e.g. ethics, vaccine, trials, uterus, cervix/cervicalcancer and hysterectomy. Time constraints on the administration of the test meantthat the test-takers were allowed just 3 min to read the text, although it was alsoavailable to refer to while they were taking the test itself.Both versions of the input material were recorded on audiotape, with standard

instructions and pauses included. There was a pause of 2 min before each of the foursections of the test to allow time to read the test items based on that section. Thetest-takers needed to write answers to the test items while they were listening to theinput, although there was also a 1-min pause at the end of each section (and 2 minafter the last section) for them to complete and check their responses. While it canbe argued that listening to input and writing answers in this concurrent fashion iscognitively quite demanding, it simulated the target language use situation in thatuniversity students commonly need to take notes as they listen to lectures or seminardiscussions. The total running time for the monologue tape was 32 min and for the


discussion tape it was 37 min. Thus, the interactive nature of the discussion meantthat it took several minutes longer to cover the test content, as compared to thescripted talk.A common set of test items was developed for the two versions of the test. They

were originally drafted on the basis of the monologue script, for the practical reasonthat the discussion version had not been transcribed at that point. They were sub-sequently checked against the transcript, to ensure that all of the information neededto respond to the items was included in the discussion and that the order of the itemsfollowed the sequence of information in both versions. The items were of the short-answer type, requiring responses ranging from a single word or number to a sen-tence. In its final form, there were 36 items in the test. They were presented to thetest-takers in a booklet in which they wrote their responses. The booklet also con-tained the introductory written text.In addition to the test material, a questionnaire was written to be administered to

the participants after they had completed the test. Part 1 of the instrument containedsix items to obtain background information from the learners. This included theirfirst language, length of residence in the country, plans after the course, and self-rating of their listening skills. Part 2 elicited the learners’ opinions about the test bymeans of eight three- or four-option closed-response items. The items focussed onthe comprehensibility of the speech, the suitability of the topic, the presentation ofthe test questions and the overall difficulty of the test. At the end of the ques-tionnaire, the students were invited to write their own comments about the test. As itturned out, very few of them did so and thus this final question yielded no usabledata.

2.2. Participants

The original plan was to conduct the study in both Melbourne and Wellington.However, there were only a small number of learners available to be tested in Mel-bourne during the period when the data-gathering needed to be carried out. Thus,most of the participants in the study were adult non-native speakers of Englishtaking a 3-month intensive English course at Victoria University of Wellington.They came from a range of national and language backgrounds, as is typical forcourses of this kind in English-speaking countries. Their predominant geographicalorigin was East and Southeast Asia, accounting for three-quarters of the total.According to the questionnaire responses, the largest language groups were speakersof Chinese (31), Japanese (11), Korean (9), Vietnamese (9) and Thai (5). Otherregions of the world represented were Europe (10), South Asia (5), East Africa (2)and the South Pacific (2). The learners were a mix of relatively recent immigrantsand international students who had entered the country on a student visa. Twenty-seven of them had been in the country only since the beginning of the course and afurther 34 had been resident for up to 1 year. The others had lived in the country forvarying periods, but only four had been there for more than 5 years.Most of the learners were studying English for academic or occupational pur-

poses. About two-thirds of them reported that, after the end of the course, they


would undertake studies at the undergraduate or postgraduate level. Of the others,10 planned to return to their own country, nine intended to work in New Zealandand seven were going to take another English course.The New Zealand participants were in six classes, constituted partly on the basis

of proficiency level, as determined by a placement test at the beginning of the course,but also according to whether they planned to study at the undergraduate or post-graduate level. They were in the same class for most of the 23 h of instruction perweek, following an integrated-skills course based on a series of weekly study themes.Each week all the classes attended a talk by a guest lecturer and they also had twosessions in an audio-visual classroom equipped with a language laboratory system.The present study was conducted in the eighth and ninth weeks of the course.An additional eight students were individually recruited in Melbourne and were

tested following the same procedure as in Wellington.

2.3. Procedure

The testing took place in Wellington during two of each class’s weekly sessions inthe audio-visual classroom. In the first week of the study, the students took oneversion of the existing Melbourne listening test on the topic of dreams. This func-tioned as a pre-test, to provide a common measure of the participants’ listeningproficiency and to give data necessary for the formation of matched pairs. The stu-dents were paired primarily on the basis of their pre-test scores but—especially whenan odd number of them obtained a particular score—class membership, gender andnational origin were also taken into account. One member of each pair was ran-domly assigned to Group A and the other went to Group B.The following week, the six classes were scheduled in twos for their sessions in the

two adjoining audio-visual classrooms. At each session, students from both classeswho had been assigned to Group A went to one classroom and those assigned toGroup B went to the other. It had been randomly determined that Group A wouldhear Version 1 (monologue) and Group B Version 2 (discussion). Nine students whotook the pre-test did not attend the experimental test session in Week 2. Conversely,there were 11 students who missed the pre-test but took the experimental one. Thelatter participants were paired with another absentee from their own class as muchas possible and randomly assigned to Group A or B. In one class, all the studentstook both tests; otherwise the absentees were fairly evenly spread across the fiveremaining classes. In total, then, 96 participants took the experimental test: 47 inGroup A and 49 in Group B.First, the test-takers received the answer booklet containing the 500-word written

background text as well as the listening test items. They were given three minutes toread the text and then the test itself began. As in the pre-test, the test tape wasplayed from the language laboratory console and the participants listened throughheadsets at their individual booths. When they had completed the test, the answerbooklets were collected and the questionnaire was handed out. The participants fil-led in the questionnaire right away and returned it to the test supervisor when theyhad finished.


3. Results and discussion

3.1. The whole test

The test proved to be quite difficult overall, with a mean score for all 96 test-takersof just 16.0 out of 36 and a standard deviation of 8.0. Given that some subjectivejudgement was involved in scoring the short-answer questions, the reliability of thewhole test was quite satisfactory, at 0.89 (based on a Rasch calculation of the relia-bility of the person ability estimates).In Part B of the post-test questionnaire, there were some items that applied to the

test as a whole and the two groups of test-takers had very similar response patternsto them. Thus, it is useful to review these items before proceeding to consider thecomparison of the two versions of the test. The combined responses from the twogroups are presented in Table 1.With regard to the topic of medical ethics, two-thirds of the test-takers reported it

was very familiar to them, which was a little surprising, and it is unlikely that morethan a few of them had prior knowledge of the two particular cases which werediscussed. Their level of interest in the topic was somewhat lower than its famil-iarity, but still most of them rated it as at least ‘‘interesting’’.The other three questionnaire items in Table 1 were concerned with the way that

the test questions were presented in the test booklet. The majority of the studentsindicated that they did not have a problem understanding how they were to respondto the questions. On the other hand, between a quarter and a third of them con-sidered the layout hard to follow, the test questions difficult to understand and/orthe reading time insufficient. A review of the individual responses shows that thesewere not the same students in each case; only five of them gave this exact response

Table 1

Reactions to the test by the test-takers as a whole

The topic (medical ethics) was

Very familiar Familiar Not familiar No response Total

67 22 2 5 96

Very interesting Interesting Not interesting No response Total

30 57 7 2 96

The layout of the test booklet was

Very clear Clear Hard to follow Total

13 56 27 96

Understanding the test questions was

Easy Quite easy Difficult Total

18 48 30 96

The time allowed for reading the test questions was

Too short Just right Too long Total

26 67 3 96


pattern. Nevertheless, the fact that a substantial number expressed these concerns—including students who obtained good scores in the test—means that these aspectsshould be further investigated before any operational administration of the test.

3.2. The comparison of the versions

The key question in this study was how the two groups performed according towhich version of the test they took. The relevant statistics are set out in Table 2.When we compare the mean scores, we find that Group A clearly had the highermean (18.0 vs. 14.16) and the t-test confirmed that it was a significant difference atthe 0.05 level. Thus, contrary to our expectation based on the Shohamy and Inbar(1991) study, the monologue version of the test turned out to be less difficult thanthe discussion one.There are a number of possible reasons for the higher mean score achieved by the

group who took the monologue version of the test. One is simply a practice effect.Group A had taken a very similar test the previous week, one which also involvedlistening to a scripted talk. On the other hand, Group B were presented with a newform of input, which they may not have previously encountered in a listening test.To control for this potential effect, the pre-test should ideally have been in a differ-ent format from either of the tests used in the experiment.The second advantage that the monologue group may have had concerns the

relationship between the two forms of stimulus material and the test items. As pre-viously explained, the items were originally written on the basis of the monologuescript, then later checked against the transcript and modified where necessary toensure that all of the information needed to answer the test items was presented inboth texts. Despite this precaution, the items are still likely to have had a closerrelationship to the monologue script than to the discussion. For instance, a sub-sequent review of the test material shows that for at least one pair of items theinformation required was given in the reverse order in the discussion, whereas itfollowed the same sequence as the items in the monologue. In addition, a couple ofthe students who took the discussion version commented to the test administratorafterwards that they had understood what was said reasonably well but had found itdifficult to compose their responses to the questions requiring full-sentence answers.

Table 2

Statistical analysis of the two versions of the experimental test

Group A (monologue) Group B (discussion)

No. of test-takers 47 49

No. of items 36 36

Mean 18.0 14.16

Standard deviation 8.32 7.40

Reliabilitya 0.90 0.88

T-test t=2.38 (P<0.05; df=94)

a The Rasch analogue of KR-20, an internal consistency measure.


This suggested the idea that a scripted monologue may present ideas in a form thatmakes it easier to supply answers than a discussion does.The third source of evidence for the differences in group performance is the post-

test questionnaire, particularly in items where the two groups gave different patternsof response. The results are presented in Table 3. The first three items in this cate-gory, which related to perceptions of the speech the test-takers heard on the tape, allproduced statistically significant differences. With regard to the speed of the speech,half of the group who listened to the discussion rated the speakers as ‘‘too fast’’,whereas the monologue group mostly found the speech was at a good speed forthem. This result is understandable, given that the discussion group had to contendwith three speakers, who often had a succession of short turns. However, the test-takers’ perceptions of the voices are more difficult to interpret. The group who heardthe monologue version rated the clarity of the speaker’s voice and accent lower thanthe discussion group did. This is somewhat surprising, since the monologue grouphad only one speaker to listen to, whereas the discussion group heard three. In

Table 3

Responses of the two groups of test-takers to the different versions of the test

The speaker(s) spoke

Too fast At a good speed Too slowly Total

Group A (monologue) 12 35 0 47

Group B (discussion) 24 24 1 49

w2=7.13 (p<0.01, df=1)

The voice(s) on the tape were

Very clear Quite clear Not very clear No response Total

Group A 0 17 29 1 47

Group B 2 26 21 0 49

w2=3.94 (p<0.05, df=1)

The accent of the speaker(s) was

Very clear Reasonably clear Hard to understand Total

Group A 0 21 26 47

Group B 3 30 16 49

w2=4.45 (p<0.05, df=1)

The time allowed for answering the questions was

Too short Just right Too long Total

Group A (monologue) 32 15 0 47

Group B (discussion) 25 24 0 49

w2=2.23 (n.s., df=1)

In general, the test was

Very difficult Quite difficult Quite easy Very easy Total

Group A 8 34 5 0 47

Group B 18 26 5 0 49

w2=4.87 (n.s., df=2)


addition, one of the speakers in the discussion was actually the presenter of themonologue.The next item elicited the test-takers’ perceptions of the amount of time allowed

for answering the questions. There was no significant difference in the groupresponses in this case. If anything, it was those who listened to the monologue whotended to consider that there was not enough time. Thus, this item does not provideany evidence that Group B were at a disadvantage through having to respond to thetest questions by listening to the discussion rather than the scripted talk.Finally, the last questionnaire item referred to the general level of difficulty of the

test. The test-takers as a whole responded on the ‘‘difficult’’ side of the scale andthose in Group B had a stronger tendency to rate it as being very difficult, butoverall the response patterns were not significantly different. Both groups perceivedit to be a challenging test.In summary, then, the questionnaire responses have not produced consistent evi-

dence that might help to explain the significantly higher scores obtained by thegroup who received the monologue version of the test.Another approach to explaining the differences between the two kinds of input

focuses on the discrepancy between our results and those of Shohamy and Inbar(1991). It is likely that our input texts were not strictly comparable with those usedin the earlier study. If we take the monologue texts first, the Shohamy and Inbar‘‘news broadcast’’ was deliberately designed to have the characteristics of a writtentext, with ‘‘literate and explicit vocabulary, complex sentences and . . .no redundan-cies, repetitions or pauses’’ (1991: 28). By contrast, our monologue script was writ-ten to be somewhat more ‘‘oral’’ in nature, with a restricted amount ofnominalization and syntactic complexity, some repetition of key vocabulary andnumerous discourse markers to give explicit signals of the structure of the text. Itmay in fact have had more in common with the Shohamy and Inbar lecturette, apartfrom the fact that there was no interaction with an addressee.In the case of the comparison between our discussion text and the consultative

dialogue of the earlier study, there were undoubtedly many shared features. How-ever, one crucial difference involves the way that the material was developed.Although this is not explicitly stated by Shohamy and Inbar (1991), their con-sultative dialogue was scripted in advance, as were their other two texts, and thenthe features of each text were carefully analysed in order to ensure that they defineda distinct genre along the literate-oral continuum (Inbar-Lourie, 1987). This wasquite different from the process by which our discussion text evolved, as describedabove. Although our speakers prepared beforehand and the actual discussion wasguided so that it would shadow the structure of the monologue, none of the speechwas scripted. Consequently, our discussion was probably a more genuine sample ofthe kind of ‘‘spontaneous’’ and ‘‘colloquial’’ speech that Shohamy and Inbar set outto simulate in their consultative dialogue, to the extent that ours was very demand-ing for an audience of non-native speakers to listen to.Thus, in various respects our input texts had different characteristics from those

used in the previous study. It is on this basis, then, that we can interpret the incon-sistency between our result—that the discussion was more difficult for our learners


to comprehend—and Shohamy and Inbar’s finding that oral texts incorporatingdialogue were easier to understand. One implication of the discrepancy is that thenotion of a single literate-oral continuum is an inadequate concept to capture thecomplex ways in which oral texts may vary, as for instance Biber (1988) demon-strated through his multidimensional analysis of the features of spoken and writtentexts. It appears that the analysis undertaken by Shohamy and Inbar was able todistinguish their three forms of input as text, but it did not fully account for therhetorical complexity of these kinds of discourse from a listener’s perspective, espe-cially when more than one speaker was involved and the speech was unscripted.

4. Conclusion

The comparison of the two forms of listening test input is a somewhat artificialexercise. Ultimately the point is not simply to demonstrate that one is superior to theother, or alternatively that the two are interchangeable. Each needs to be consideredfor inclusion in a test on its own terms. The conclusion that Shohamy and Inbar(1991: 37) drew from their study was that listening tests should consist of more thanone form of input in order to reflect the range of genres inherent in underlying the-ories of listening comprehension. In an EAP context, this means that we shouldquestion whether a scripted simulation of a lecture is an adequate way of oper-ationalising the construct of academic listening ability. The value of a comparativestudy then is, first of all, to demonstrate that students perform differently accordingto the nature of the input and secondly, to bring out the distinctive features of eachgenre.From this perspective, the fact that our two texts had the reverse order of difficulty

from that found by Shohamy and Inbar (1991) is not a matter of great concern. Aspreviously discussed, there are some possible reasons for the different result relatedto the way that we conducted our study, but the key factor is likely to have been thecomplexity of the variables involved in any comparison of distinctive types of spo-ken text. While Shohamy and Inbar highlighted features of an interactive text whichcould make it easier to understand, there are numerous other characteristics whichcan create greater difficulty for listeners, particularly when it is an unscripted dis-cussion on a topic of which they may have limited background knowledge. Thus, theissue for listening test design is how to adjust the level of difficulty of an interactivetext to make it suitable for the proficiency level of the test-takers, while at the sametime maintaining its authenticity as a sample of academic speech.One of the problems in achieving this balance is the lack of research on speaking

in academic contexts that might serve as a reference point for the validation of testtasks. There is some work on university lectures and the problems non-native-speaking students experience in them (see, e.g. Flowerdew, 1994), but less on otheracademic listening situations. A few discourse analyses of seminar discussions havenow appeared (Basturkmen, 2002; Tapper, 1996). In addition, there are least twocorpora, the TOEFL Spoken and Written Academic Corpus (Biber et al., 2002) andthe Michigan Corpus of Academic Spoken English (MICASE) (Simpson et al.,


2000), which include recordings of speech in various campus settings. TheMICASE corpus in particular is freely available on the Web and, within the limi-tations of its relatively small size, will allow researchers and test developers toobtain evidence of the key features of different kinds of spoken interaction in aca-demic contexts.However, as some authors have noted recently (see Douglas & Nissan, 2001;

Buck, 2001: 154–160), it is difficult to find corpus texts or other authentic spo-ken material which can be used directly as input for a listening test. Therefore,creating stimulus material specifically for the test tends to be the most feasibleoption, and our experience with the discussion text may be useful as a modelfor other test development projects. If the text is not to be scripted, a certainamount of preparation is required to help ensure that the discussion will beaccessible to the listeners and will have the kind of structure and content thatfacilitate the writing of test items, while still capturing the qualities of naturalspeech. As we found, the first attempt may need to be discarded. It is interest-ing to note in this regard that, for the Hong Kong project reported by Coniam(2001: 6), no fewer than five discussions were recorded before the moderationcommittee decided on the most suitable one for their listening test. With anaudio recording one potential problem for the listeners is to distinguish theparticipants in a discussion. Even when (as in our case) the test items do not requirethe test-takers to attribute information or opinions to particular speakers, it ishelpful to be able to identify the individual voices. We found that gender, role, ageand point of view could be used effectively to distinguish the three speakers in ourdiscussion.This study has addressed some of the questions about listening test design

which I listed at the outset, but by no means all of them. On the issue whichwas the main focus of our research—the form of the input—I have shown thatthere are numerous variables which can affect the difficulty of an interactive textin audio form. While work on various forms of visual input will undoubtedly con-tinue, there is also room for further work on audiotaped alternatives to the scriptedtalk in order that tests may represent more fully the construct of academic listeningability.

Acknowledgements

My collaborators in this project were Kathryn Hill and Elisabeth Grove at theLanguage Testing Research Centre of the University of Melbourne, and I expressmy great appreciation for all their contributions to the design of the study, thedevelopment of the test material and the analysis of the data. I also wish to thankAlastair Ker and Susan Smith for their help in planning and implementing the pro-ject in Wellington. The work was funded by a grant from the CollaborativeResearch Program at the University of Melbourne, together with matching supportfrom the School of Linguistics and Applied Language Studies, Victoria Universityof Wellington.


References

Bachman, L. F., & Palmer, A. S. (1996). Language testing in practice. Oxford: Oxford University Press.

Basturkmen, H. (2002). Negotiating meaning in seminar-type discussion and EAP. English for Specific

Purposes, 21, 233–242.

Biber, D. (1988). Variation across speech and writing. Cambridge: Cambridge University Press.

Biber, D., Conrad, S., Reppen, R., Byrd, P., & Helt, M. (2002). Speaking and writing in the university: a

multidimensional comparison. TESOL Quarterly, 36, 9–48.

Buck, G. (2001). Assessing listening. Cambridge: Cambridge University Press.

Chiang, C. S., & Dunkel, P. (1992). The effect of speech modification, prior knowledge, and listening

proficiency on EFL lecture listening. TESOL Quarterly, 26, 345–374.

Coniam, D. (2001). The use of audio or video comprehension as an assessment instrument in the certifi-

cation of English language teachers: a case study. System, 29, 1–14.

Douglas, D., & Nissan, S. (2001). Developing listening prototypes using a corpus of spoken academic Eng-

lish. Paper presented at the Language Testing Research Colloquium, St. Louis, MO.

Flowerdew, J. (Ed.). (1994). Academic listening: research perspectives. Cambridge: Cambridge University

Press.

Freedle, R., & Kostin, I. (1999). Does the text matter in a multiple-choice test of comprehension? The case

for the construct validity of TOEFL’s minitalks. Language Testing, 16, 2–32.

Ginther, A. (2001). Effects of the presence and absence of visuals on performance on TOEFL CBT listening-

comprehensive stimuli. TOEFL Research Report, 66. Princeton, NJ: Educational Testing Service.

Groot, P. J. M. (1975). Testing communicative competence in listening comprehension. In R. L. Jones, &

B. Spolsky (Eds.), Testing language proficiency. Arlington, VA: Center for Applied Linguistics.

Gruba, P. (1993). A comparison study of audio and video in language testing. JALT Journal, 16, 85–88.

Gruba, P. (1997). The role of video media in listening assessment. System, 25, 335–345.

Inbar-Lourie, O. (1987). The effect of text and question type on achievement in listening comprehension tests

in English as a Foreign Language. Unpublished Master’s thesis, Tel-Aviv University.

Jensen, C., & Hansen, C. (1995). The effect of prior knowledge on EAP listening test performance. Lan-

guage Testing, 12, 99–119.

Nissan, S., DeVincenzi, F., & Tang, K. L. (1996). An analysis of factors affecting the difficulty of dialogue

items in TOEFL listening comprehension. TOEFL Research Report, 51. Princeton, NJ: Educational

Testing Service.

Progosh, D. (1996). Using video for listening assessment: opinions of test-takers. TESL Canada Journal,

14, 34–44.

Schmidt-Rinehart, B. C. (1994). The effects of topic familiarity on second language listening comprehen-

sion.Modern Language Journal, 78, 179–189.

Sherman, J. (1997). The effect of question preview in listening comprehension tests. Language Testing, 14,

185–213.

Shohamy, E., & Inbar, O. (1991). Validation of listening comprehension tests: the effect of text and

question type. Language Testing, 8, 23–40.

Simpson, R. C., Briggs, S. L., Ovens, J., & Swales, J. M. (2000). The Michigan corpus of spoken academic

English. Ann Arbor, MI: The Regents of the University of Michigan [Available online at www.lsa.u-

mich.edu/eli/micase/micase.htm].

Tapper, J. (1996). Exchange patterns in the oral discourse of international students in college classrooms.

Discourse Processes, 22, 25–55.

John Read teaches courses in applied linguistics, TESOL and academic writing at Victoria University of

Wellington, New Zealand. His research interests are in the testing of English for academic purposes and

second language vocabulary assessment. He is the author of Assessing Vocabulary (Cambridge, 2000) and

is co-editor of Language Testing.


the use of interactive input in eap listening assessment

Documents