pros and cons of multiple choice question in language testing

The Use of Multiple-Choice Questions

in Language Testing

A paper assignment for

Language Testing and Evaluation

ENGL 6201

By: Ihsan Ibadurrahman (G1025429)

The use of Multiple Choice Questions in Language Testing

I. Introduction

Multiple Choice Question (MCQ), also referred to as an objective test, is one of the test techniques in

language testing where candidates are typically asked to respond by giving one correct answer from the

available options (Hughes, 2003). In designing these MCQs, most test designers utilize Bloom’s

taxonomy to check whether specific instructional objectives have been met during the course (Samad,

2010). Bloom’s taxonomy is a set of six level hierarchically ordered learning objectives that are

proposed by Benjamin Bloom in 1956 (Wikipedia, 2012). This paper aims to illustrate how MCQs are

created using Bloom’s taxonomy. Seven samples of MCQs will be presented based on a reading passage

entitled Smart US dog learns more than 1,000, written by Kerry Sheridan (2012). Justification of how

these MCQs are created will also be provided.

MCQ is the most commonly used testing techniques in many academic achievement tests.

However, it is also the most criticized techniques (Lee, 2011). The second aim of this paper is thus to

elaborate the advantages and disadvantages of using MCQ in language testing by reviewing current

available literature review in the field. Discussion and conclusion are given at the end of the paper.

II. A Sample of Multiple Choice Questions

Below is a sample of multiple choice questions using Bloom’s taxonomy, which orders instructional

objectives in 6 different hierarchical levels: Knowledge, Comprehension, Application, Analysis, Synthesis,

and Evaluation. Knowledge refers to the ability to recall information from the text. Comprehension

refers to the ability to understand the text. Application is the ability to apply what is understood from

the text to a new situation. Analysis is the ability to compare and contrast different information.

1

Synthesis is the ability to create new information based on the given text. And lastly, evaluation is the

ability to judge a piece of information. Due to the constraint of the type of the text however, the

questions could only be targeted to the level five of the learning objective in Bloom’s - synthesis.

Following are questions related to the reading passage. For each question, choose the correct answer by marking A, B, C, or D on your answer sheet.

1. What does the passage mainly discuss?

A. A comparison between two different dogs: Casher and Rico.

B. A study of how to increase dogs’ memory capability.

C. A report of a herding dog that learns a mass of vocabulary.

D. A report of how trainers can train their dogs differently.

2. After three years of training, Chase stopped learning new words because ____

A. her memory capacity was at its limit.

B. her trainers seldom rewarded her with food anymore.

C. her trainers felt that it was enough.

D. her trainers thought that there was not enough time.

3. “By the time the pup was five months old, language training began in earnest.” (par. 9)

The underlined word may be replaced with _____

A. a careful manner

B. a serious manner

C. a slow manner

D. a quick manner

4. Which of the following is NOT Chaser’s ability ____

A. distinguishing between different toys

B. recognizing objects from loads of toys

C. separating verbs from nouns

D. understanding the word order

2

5. Rico fails to distinguish the difference between ‘ball’ and ‘get-the-ball’ because he can’t ____

A. store any more vocabulary

B. recognize noun phrases

C. separate nouns from commands

D. respond to requests

6. “Hasan has got a Siberian husky dog called Maxi. She is 5 years old, and has never been language-

trained.” Which of the following statements are supported by the passage?

A. Learning words is not possible because Maxi is not a border collie.

B. Training language should have begun when Maxi was 5 months old.

C. Training Maxi is still possible, even without treating her nice food.

D. None of the above statements are supported by the passage.

7. Study the type of human memory in the text below.

“There are many types of memory. In general terms, memory can be divided into two: procedural

and declarative. Procedural memory is based on implicit learning, and is primarily employed in

learning motor skills. Declarative memory, on the other hand, requires conscious recall, in that some

conscious process must call back the information. Another type of memory is called visual memory

which is part of memory preserving some characteristics of our senses pertaining to visual

experience. One is able to place in memory information that resembles objects, places, animals or

people in sort of a mental image. Topographic memory is the ability to orient oneself in space, to

recognize and follow an itinerary, or to recognize familiar places. Getting lost when traveling alone is

an example of the failure of topographic memory.” (Wikipedia, 2012)

From the information above, it can be said that Chase has an excellent ______ memory.

A. Procedural

B. Declarative

C. Visual

D. Topographic

3

III. Comments

Judging from the difficulty of the text, it is deemed most suitable for secondary school students.

Therefore, the language, the types, and the difficulty of the questions outlined above followed

accordingly. I shall use the word students throughout the rest of this section to refer to any test-takers

or candidates taking this particular test. All the above seven multiple choice questions strictly followed

guidelines given by Samad (2010) and Woodford & Bancroft (2005) in which all the options/alternatives

in each question were written to be balanced in terms of sentence length, and similar in part of speech.

As stated in the guidelines, a chronological order was used in sorting out the questions so that students

would not get cognitively loaded. For each question, elaboration is given as follows.

Item one is a synthesis question as it attempts to test out students’ ability in synthesizing all the

information from the passage to decide the best possible answer of all four. Although it may seem a bit

too early to jump into a higher order cognitive skill such as synthesis, this question needs to be placed in

the beginning since it actually discusses the ‘title’, a beginning of a story. Assuming that the question is

used in a real test and not for just the sake of this paper, the title of the passage would be removed. It is

thus clear that the key in this question is C. A report of a herding dog that learns a mass of vocabulary

which sounds like the title itself – Smart US dog learns more than 1,000 words. Distractors were written

in a way that they would give students a somewhat vague answer. This is a case of “correct versus best

answer” as recommended by Tamir (1991, p. 188) where instead of having one correct answer and

everything else are outright wrong, students need to select the best answer that summarizes best what

the text is about. This clearly sets the first question in a higher cognitive level compared to the rest of

the questions.

Item two is a type of knowledge question since it requires the students only to recall the

information given in the text in order to answer the key D. her trainers thought that there was not

4

enough time. However, in order to make the answer less obvious, exact wording from the text was

avoided. Item three clearly falls into a comprehension question as it tests students’ ability to understand

the meaning of the underlined word from the context. Item four is yet another knowledge question in

which students are asked to recall information from the text to rule out which information is not

considered as true. Thus, it seems logically necessary to put the key in the last position to avoid the

possibility of eliminating the distractors in alphabetical order. This is to say that if the key was put in the

initial position as option A., students with knowledge would not read the rest of the options.

The alternatives in item 5 were written in such a way that it avoids redundancy and unnecessary

confusion (Samad, 2010; Woodford & Bancroft, 2005). A redundancy could have been caused by placing

the same word he can’t in all of the alternatives, while confusion could be caused by a possible double

negative in the stem as in “Rico can’t …. Because he can’t”. Thus to avoid this confusion, the word fail

was used in the stem. For this item, a comprehension question is used since it tests out students’ ability

to retain specific information from the test as well as understanding the meaning of discern (par.17) and

look for its closest meaning in the alternatives.

A higher order thinking skill is used for the last 2 questions of the test. Item 6 uses application

question by asking students to study a new situation and apply what they learn from the text in order to

solve it. Another difficulty in this question lies in the use of none of the above as distractor. Woodford

and Bancroft (2005) mention that it is commonly regarded as an affective option compared to its

counterpart – all of the above, since it requires students to eliminate all the other options first before it

can finally be chosen. The last question, item 7, goes up a level further in the taxonomy. A level of

analysis question is achieved by asking students to break down all the crucial information from the new

text, as well as to compare and contrast between different types of memory in order to come up with

the answer.

5

III. Literature Review

Following is a brief overview of literature on MCQ. This section will be divided into two: arguments that

support MCQ, and those that go against it. However, before going further, there are four most relevant

journal articles worth mentioning. The first one is a quantitative study by Cheng (2004) who compares

the use of three different test techniques in a listening performance: A multiple-choice test, a multiple-

choice cloze test, and an open-ended test technique. From one hundred fifty nine Mandarin Chinese

technical college students who participate in the study, it is revealed that the participants score highest

in MCC (mean = 35.33), followed by MC (mean = 33.84), and score lowest in OE (mean = 25.23). This

suggests that listening performance is enhanced when a type of selected response category such as

multiple-choice is used.

A somewhat similar study is conducted by Ajideh & Rajab (2009) who compare the use of MCQs

and cloze-tests in measuring vocabulary proficiency. Using 21 Iranian undergraduate EFL students as its

sample, the study reveals that, at a figure of .57, there is no significant correlation between discrete-

point item test (MCQ) and integrative cloze test in a vocabulary test. This indicates that those who

perform in cloze test might have a similar result if they take a discrete-point item test. It also suggests

the possibility of substituting MCQ for the traditional cloze test in vocabulary tests.

Although MCQ might have its place in a vocabulary test, its use in a grammar test is still in

question. Currie and Chiramanee (2010) compare the difference between the responses of MCQ and

constructed-response format in a grammar test. In their study, 152 undergraduates Thai students take

the grammar test first and later take a similar multiple-choice tests using students’ error on the first test

as distractors. The study reveals that the score increases in the second test, with only 26% of the

responses are found to be similar to the first. It is assumed that guessing could be one of the factors that

affect the success rate in the second test.

6

Although not directly related to language testing, McCoubrie (2004) seeks to bring to light the

unfairness that MCQ has been generally criticized for, particularly in the area of medical assessment.

This is especially true for medical students since they need to be professionally able to think quickly on

their feet in solving patients’ problems and MCQ provides these students the chance to make life-and-

death decisions swiftly in a timed test. In his literature review, he concludes that a fair MCQ exam is one

that: (1) is related to the syllabus, (2) integrates practical competence testing in it, (3) can fairly

represent all the important materials in the syllabus, (4) is error-free, (5) uses a criterion-referenced

test, (6) exercises caution when questions are reused, (7) uses Extended-matching question (EMQ)

format for a more reliable, and time-efficient testing.

a. Arguments supporting multiple choice items

From the available literature, MCQ is generally favored for tests that measure receptive skills. For

instance, Cheng’s (2004) study highlights the preference of using MCQ over short-answer item in a

listening test. In the study, participants’ scores are higher in MCQ mainly due to guessing, memory

constraints, and the ability for test-takers to predict what is coming before listening. Another possible

reason is that MCQ strips away the ambiguity caused in most open-ended questions (Linn & Miller,

2005). MCQ appeals for use in testing receptive skills since they do not require the candidate to produce

the language (Samad, 2010). Hence, MCQ makes it possible to obtain accurate information on

candidate’s receptive skills and abilities, which would otherwise be difficult.

MCQ can be very useful for large-scale high-stake testing where scoring thousands of essays

would seem practically impossible (Hughes, 2003; McCoubrie, 2004; Woodford & Bancroft, 2005).

Considering this one benefit of MCQ, Ajideh and Rajab (2009) suggest that MCQ could substitute the

traditional cloze-test formats in vocabulary tests when number of test-takers need to be taken into

consideration. Most large-scale tests rely on MCQ because of its practicality and reliability (Cheng, 2004;

7

Lee, 2011). It is practical because scoring can be done quickly using a computer scanner. Furthermore,

with the increasing number of computer-based MCQs nowadays, testing time has been reported to be

significantly shorter (McCoubrie, 2004). MCQ is also reliable since no judgment has to be made by the

assessor of the test. High reliability is what sets MCQ apart from other test techniques. In scoring

hundreds of essays, for instance, there is a concern that the last few essays are not scored as reliably as

the first few due to the fatigue (Samad, 2010).

MCQ also allows those with knowledge but are poor writers to excel since what they are only

asked is to respond by marking their correct answers on the test (Tamil, 1991). Since it is time-efficient,

this means that testers may be able to cover a wide area of topics with more questions included in the

test, compared to other test techniques (McCoubrie, 2004). By using computer generated software, now

testers are even able to create a test item bank which can be instantly produced on demand, or even

reused for future testing (Hoshino & Nakagawa, 2005).

b. Arguments against multiple choice items

As mentioned previously, MCQ might be favorable in measuring listening skills. However, it might fall

short in accurately measuring language competence such as grammar (Currie & Chiramanee, 2010).

Candidates may be able to pick the right answer in an MCQ test, but to be able to perform it in written

or spoken language is an entirely different story. Hughes (2003) also questions the validity of one part in

TOEFL paper-based test where it aims to measure candidates’ writing ability, but it merely asks them to

identify grammatically incorrect language items in MCQ format. Most MCQ tests, thus, falls short in

achieving validity for the sake of reliability.

A scathing criticism that has often been leveled to MCQ is that there is always an element of

guessing that causes noise in language testing (Currie & Chiramanee, 2010: 487). In a four option

multiple choice option, for example, there is a 25% possibility of getting the correct answer through

8

guessing. In this sort of guessing game, candidates may be able to pick the correct answer without

having to read the passage or understanding the meaning in it (Lee, 2011: 31). Similarly, a correct

answer can also be the result of eliminating incorrect answers, without the candidates knowing the right

answer in the first place (Woodford & Bancroft, 2005).

Another most commonly cited disadvantage of MCQ is the difficulty in writing one (Burton et al,

1991; Linn & Miller, 2005). Hughes (2003) explains that MCQs are very difficult to write; a lot of careful

steps need to be accounted for in writing a good MCQ. It needs to be well-written, retested, and

statistically analyzed. Schutz et al. (2008) assert that coming up with good distractors (i.e. those that

could potentially be chosen by many candidates) contribute to the difficulty of writing MCQ. This means

that a great deal of effort and time need to go into the construction of MCQ.

There are many other disadvantages in using MCQ in a test. Hughes (2003) mentions some of

these shortcomings:

1. It measures only recognition knowledge: MCQ cannot directly measure the gap between

candidate’s productive and receptive skills. For instance, a candidate who gets the correct

answer in a grammar test may probably not be able to use the grammar item correctly in

speaking or writing. In other words, it only tells half the picture of someone’s ability in

language – the knowledge, not the use.

2. It limits what can be tested: MCQ may be limited to test language items that have a range of

available distractors. For example, while it is possible to test candidate’s knowledge on

English prepositions using many different variants of them as distractors, it would be

difficult to find distractors in asking the difference between present perfect and past simple.

This may lead to MCQ with poor distractors, which in turn may give hints to the correct

answer, without the candidates even having to guess. MCQ also lend itself very easily to

9

apply only Bloom’s taxonomy lower level thinking skills in the items (Paxton, 2000; Vleuten,

1999 as cited in McCoubrie, 2004), which in turn encourages student’s rote learning.

(Roediger, H. & Marsh, E., 2005; Tamir, 1999).

3. It may create a harmful backwash effect: high-stakes testing that relies heavily on MCQ may

create a negative effect on teaching and learning in class. Teachers would concentrate more

on the preparation of the exam such as training students how to answer MCQs with

educated guesses rather than actually getting students to improve their language.

4. It may facilitate cheating: It would be very easy for students to pass down the answers to

others by using non-verbal language such as body gesture.

IV. Discussion and conclusion

From the reviewed articles, it can be inferred that MCQ has both its strengths and weaknesses. MCQ’s

benefits lie in its high reliability, accuracy for receptive skills, practicality, quick scoring, and the ability to

include many question items. However, it has its shortcomings which include: harmful backwash,

facilitating guesswork and cheating, lack of validity, limitation to test what can be tested, and excluding

higher order thinking skills. Just as there is no universally correct or best ‘methodology’ in language

teaching (Brown, 2001), similarly in language testing we may never have one best ‘test techniques’. The

pros and cons outweighed so far mean that have to choose a different kind of test techniques for every

given situation. Thus, as pointed by Ory (n.d. as cited in Samad, 2010: 32) MCQ may be most

appropriate when:

1. A large number of students are concerned

2. The test questions are reused

3. The scores needed to be graded quickly

4. A wide of test coverage needs to be included

10

5. Lower thinking skills are of greater importance

After having the experience of writing MCQ as outlined previously in this paper, the writer has to agree

on the last point above that coming up with higher learning objectives such as application, synthesis, is

and evaluation are possible but might be difficult. This is probably what leads to a condition where

higher learning objectives are deemed necessary, but are omitted due to the implausibility of writing

one (Paxton, 2000). As mentioned earlier, many MCQ tests are produced poorly because of the difficulty

it presents and sadly where very little care and attention is given. This is clearly something to be

avoided. MCQ is not a shortcut for an easy and cheap way of testing. This is to say that although MCQ

still has its potentials in language testing, all kinds of ‘excessive, indiscriminate, and potentially harmful

use of technique’ should be avoided (Hughes, 2003: 78).

11

References

Ajideh,P. & Esfandiari, R. (2009). ‘A Close Look at the Relationship between Multiple Choice Vocabulary

Test and Integrative Cloze Test of Lexical Words in Iranian Context’, English Language Teaching,

Vol. 2(3), pp. 163-170.

Bloom's Taxonomy. (2012, March 9). In Wikipedia, The Free Encyclopedia. Retrieved March 10, 2012,

from http://en.wikipedia.org/w/index.php?title=Bloom%27s_Taxonomy&oldid=481064754

Brown, H. (2001). Teaching by Principles, 2nd edn., New York: Pearson Education.

Burton, S., Sudweeks, R., Merrill, P. & Wood, B. (1991). How to Prepare Better Multiple-Choice Test

Items: Guidelines for University Faculty. Retrieved from:

http://testing.byu.edu/info/handbooks/betteritems.pdf

Cheng, H. (2004). ‘A comparison of Multiple-Choice and Open-Ended Response Formats for the

Assessment of Listening Performance.’ Foreign Language Annuals, Vol. 37(4), pp. 544-555.

Currrie, M. & Chiramanee, T. (2010). ‘The effect of the multiple-choice item format on the measurement

of knowledge of language structure’, Language Testing, Vol. 27, pp. 471-491.

Hughes, A. (2003). Testing for Language Teachers (2nd Ed), Cambridge: Cambridge University Press.

Hoshino. A & Nagakawa, H. (2005). ‘A real-time multiple-choice question generation for language

testing’, Proceedings of the 2nd Workshop on Building Educational Applications Using NLP,

pp. 17-20.

Lee, J. (2011). Second Language Reading Topic Familiarity and Test Score: Test-Taking Strategies for

Multiple-Choice Comprehension Questions (PhD. Thesis). Retrieved from ProQuest Dissertations

and Theses. (Accession Order No. 3494064).

Linn, R. & Miller, M. (2005). Measurement and Assessment in Teaching (9th Ed). New Jersey: Pearson

Education.

McCoubrie, P. (2004). ‘Improving the fairness of multiple-choice questions: a literature review’, Medical

Teacher, Vol. 26(8), pp. 709-712.

Memory. (2012, March 8). In Wikipedia, The Free Encyclopedia. Retrieved March 10, 2012, from

http://en.wikipedia.org/w/index.php?title=Memory&oldid=480781062

Paxton, M. (2000). ‘A linguistic perspective on multiple choice questioning’, Assessment and Evaluation

in Higher Education, Vol. 25(2), pp.109-119.

Rodrieger, H. & Marsh, E. (2005). ‘The Positive and Negative Consequences of Multiple-Choice Testing’,

Journal of Experimental Psychology: Learning, Memory, and Cognition. Vol. 31(5), pp. 1155-1159.

12

Samad, A. (2010). Essentials of Language Testing for Malaysian Teachers. Selangor: Universiti Putra

Malaysia Press

Schutz, L., Rivers, K., Schutz, J. & Proctor, A.(2008). ‘Preventing Multiple-Choice Tests From impeding

Educational Advancement After Acquired Brain Injury’. Language, Speech & Hearing Services in

Schools, Vol. 39(1), pp. 104-109.

Sheridan, K. (2011, January 7). ‘Smart US dog learns more than 1,000 words’. In MNN, Mother Nature

Network. Retrieved March 10, 2012, from http://www.mnn.com/family/pets/stories/smart-dog-

learns-more-than-1000-words

Tamir, P. (1991). ‘Multiple Choice Items: How to Gain the Most Out of Them’, Biochemical Education,

Vol. 19(4). Pp. 188-192.

Woodford, K. & Bancroft, P. (2005). ‘Multiple Choice Questions Not Considered Harmful’. Australian

Computing Education Conference 2005 – Research and Practice in Information Technology, Vol. 42.

Pp.1-8.

13