pros and cons of multiple choice question in language testing
DESCRIPTION
A literature review of Multiple Choice Question in Language Testing.TRANSCRIPT
The Use of Multiple-Choice Questions
in Language Testing
A paper assignment for
Language Testing and Evaluation
ENGL 6201
By: Ihsan Ibadurrahman (G1025429)
The use of Multiple Choice Questions in Language Testing
I. Introduction
Multiple Choice Question (MCQ), also referred to as an objective test, is one of the test techniques in
language testing where candidates are typically asked to respond by giving one correct answer from the
available options (Hughes, 2003). In designing these MCQs, most test designers utilize Bloom’s
taxonomy to check whether specific instructional objectives have been met during the course (Samad,
2010). Bloom’s taxonomy is a set of six level hierarchically ordered learning objectives that are
proposed by Benjamin Bloom in 1956 (Wikipedia, 2012). This paper aims to illustrate how MCQs are
created using Bloom’s taxonomy. Seven samples of MCQs will be presented based on a reading passage
entitled Smart US dog learns more than 1,000, written by Kerry Sheridan (2012). Justification of how
these MCQs are created will also be provided.
MCQ is the most commonly used testing techniques in many academic achievement tests.
However, it is also the most criticized techniques (Lee, 2011). The second aim of this paper is thus to
elaborate the advantages and disadvantages of using MCQ in language testing by reviewing current
available literature review in the field. Discussion and conclusion are given at the end of the paper.
II. A Sample of Multiple Choice Questions
Below is a sample of multiple choice questions using Bloom’s taxonomy, which orders instructional
objectives in 6 different hierarchical levels: Knowledge, Comprehension, Application, Analysis, Synthesis,
and Evaluation. Knowledge refers to the ability to recall information from the text. Comprehension
refers to the ability to understand the text. Application is the ability to apply what is understood from
the text to a new situation. Analysis is the ability to compare and contrast different information.
1
Synthesis is the ability to create new information based on the given text. And lastly, evaluation is the
ability to judge a piece of information. Due to the constraint of the type of the text however, the
questions could only be targeted to the level five of the learning objective in Bloom’s - synthesis.
Following are questions related to the reading passage. For each question, choose the correct answer by marking A, B, C, or D on your answer sheet.
1. What does the passage mainly discuss?
A. A comparison between two different dogs: Casher and Rico.
B. A study of how to increase dogs’ memory capability.
C. A report of a herding dog that learns a mass of vocabulary.
D. A report of how trainers can train their dogs differently.
2. After three years of training, Chase stopped learning new words because ____
A. her memory capacity was at its limit.
B. her trainers seldom rewarded her with food anymore.
C. her trainers felt that it was enough.
D. her trainers thought that there was not enough time.
3. “By the time the pup was five months old, language training began in earnest.” (par. 9)
The underlined word may be replaced with _____
A. a careful manner
B. a serious manner
C. a slow manner
D. a quick manner
4. Which of the following is NOT Chaser’s ability ____
A. distinguishing between different toys
B. recognizing objects from loads of toys
C. separating verbs from nouns
D. understanding the word order
2
5. Rico fails to distinguish the difference between ‘ball’ and ‘get-the-ball’ because he can’t ____
A. store any more vocabulary
B. recognize noun phrases
C. separate nouns from commands
D. respond to requests
6. “Hasan has got a Siberian husky dog called Maxi. She is 5 years old, and has never been language-
trained.” Which of the following statements are supported by the passage?
A. Learning words is not possible because Maxi is not a border collie.
B. Training language should have begun when Maxi was 5 months old.
C. Training Maxi is still possible, even without treating her nice food.
D. None of the above statements are supported by the passage.
7. Study the type of human memory in the text below.
“There are many types of memory. In general terms, memory can be divided into two: procedural
and declarative. Procedural memory is based on implicit learning, and is primarily employed in
learning motor skills. Declarative memory, on the other hand, requires conscious recall, in that some
conscious process must call back the information. Another type of memory is called visual memory
which is part of memory preserving some characteristics of our senses pertaining to visual
experience. One is able to place in memory information that resembles objects, places, animals or
people in sort of a mental image. Topographic memory is the ability to orient oneself in space, to
recognize and follow an itinerary, or to recognize familiar places. Getting lost when traveling alone is
an example of the failure of topographic memory.” (Wikipedia, 2012)
From the information above, it can be said that Chase has an excellent ______ memory.
A. Procedural
B. Declarative
C. Visual
D. Topographic
3
III. Comments
Judging from the difficulty of the text, it is deemed most suitable for secondary school students.
Therefore, the language, the types, and the difficulty of the questions outlined above followed
accordingly. I shall use the word students throughout the rest of this section to refer to any test-takers
or candidates taking this particular test. All the above seven multiple choice questions strictly followed
guidelines given by Samad (2010) and Woodford & Bancroft (2005) in which all the options/alternatives
in each question were written to be balanced in terms of sentence length, and similar in part of speech.
As stated in the guidelines, a chronological order was used in sorting out the questions so that students
would not get cognitively loaded. For each question, elaboration is given as follows.
Item one is a synthesis question as it attempts to test out students’ ability in synthesizing all the
information from the passage to decide the best possible answer of all four. Although it may seem a bit
too early to jump into a higher order cognitive skill such as synthesis, this question needs to be placed in
the beginning since it actually discusses the ‘title’, a beginning of a story. Assuming that the question is
used in a real test and not for just the sake of this paper, the title of the passage would be removed. It is
thus clear that the key in this question is C. A report of a herding dog that learns a mass of vocabulary
which sounds like the title itself – Smart US dog learns more than 1,000 words. Distractors were written
in a way that they would give students a somewhat vague answer. This is a case of “correct versus best
answer” as recommended by Tamir (1991, p. 188) where instead of having one correct answer and
everything else are outright wrong, students need to select the best answer that summarizes best what
the text is about. This clearly sets the first question in a higher cognitive level compared to the rest of
the questions.
Item two is a type of knowledge question since it requires the students only to recall the
information given in the text in order to answer the key D. her trainers thought that there was not
4
enough time. However, in order to make the answer less obvious, exact wording from the text was
avoided. Item three clearly falls into a comprehension question as it tests students’ ability to understand
the meaning of the underlined word from the context. Item four is yet another knowledge question in
which students are asked to recall information from the text to rule out which information is not
considered as true. Thus, it seems logically necessary to put the key in the last position to avoid the
possibility of eliminating the distractors in alphabetical order. This is to say that if the key was put in the
initial position as option A., students with knowledge would not read the rest of the options.
The alternatives in item 5 were written in such a way that it avoids redundancy and unnecessary
confusion (Samad, 2010; Woodford & Bancroft, 2005). A redundancy could have been caused by placing
the same word he can’t in all of the alternatives, while confusion could be caused by a possible double
negative in the stem as in “Rico can’t …. Because he can’t”. Thus to avoid this confusion, the word fail
was used in the stem. For this item, a comprehension question is used since it tests out students’ ability
to retain specific information from the test as well as understanding the meaning of discern (par.17) and
look for its closest meaning in the alternatives.
A higher order thinking skill is used for the last 2 questions of the test. Item 6 uses application
question by asking students to study a new situation and apply what they learn from the text in order to
solve it. Another difficulty in this question lies in the use of none of the above as distractor. Woodford
and Bancroft (2005) mention that it is commonly regarded as an affective option compared to its
counterpart – all of the above, since it requires students to eliminate all the other options first before it
can finally be chosen. The last question, item 7, goes up a level further in the taxonomy. A level of
analysis question is achieved by asking students to break down all the crucial information from the new
text, as well as to compare and contrast between different types of memory in order to come up with
the answer.
5
III. Literature Review
Following is a brief overview of literature on MCQ. This section will be divided into two: arguments that
support MCQ, and those that go against it. However, before going further, there are four most relevant
journal articles worth mentioning. The first one is a quantitative study by Cheng (2004) who compares
the use of three different test techniques in a listening performance: A multiple-choice test, a multiple-
choice cloze test, and an open-ended test technique. From one hundred fifty nine Mandarin Chinese
technical college students who participate in the study, it is revealed that the participants score highest
in MCC (mean = 35.33), followed by MC (mean = 33.84), and score lowest in OE (mean = 25.23). This
suggests that listening performance is enhanced when a type of selected response category such as
multiple-choice is used.
A somewhat similar study is conducted by Ajideh & Rajab (2009) who compare the use of MCQs
and cloze-tests in measuring vocabulary proficiency. Using 21 Iranian undergraduate EFL students as its
sample, the study reveals that, at a figure of .57, there is no significant correlation between discrete-
point item test (MCQ) and integrative cloze test in a vocabulary test. This indicates that those who
perform in cloze test might have a similar result if they take a discrete-point item test. It also suggests
the possibility of substituting MCQ for the traditional cloze test in vocabulary tests.
Although MCQ might have its place in a vocabulary test, its use in a grammar test is still in
question. Currie and Chiramanee (2010) compare the difference between the responses of MCQ and
constructed-response format in a grammar test. In their study, 152 undergraduates Thai students take
the grammar test first and later take a similar multiple-choice tests using students’ error on the first test
as distractors. The study reveals that the score increases in the second test, with only 26% of the
responses are found to be similar to the first. It is assumed that guessing could be one of the factors that
affect the success rate in the second test.
6
Although not directly related to language testing, McCoubrie (2004) seeks to bring to light the
unfairness that MCQ has been generally criticized for, particularly in the area of medical assessment.
This is especially true for medical students since they need to be professionally able to think quickly on
their feet in solving patients’ problems and MCQ provides these students the chance to make life-and-
death decisions swiftly in a timed test. In his literature review, he concludes that a fair MCQ exam is one
that: (1) is related to the syllabus, (2) integrates practical competence testing in it, (3) can fairly
represent all the important materials in the syllabus, (4) is error-free, (5) uses a criterion-referenced
test, (6) exercises caution when questions are reused, (7) uses Extended-matching question (EMQ)
format for a more reliable, and time-efficient testing.
a. Arguments supporting multiple choice items
From the available literature, MCQ is generally favored for tests that measure receptive skills. For
instance, Cheng’s (2004) study highlights the preference of using MCQ over short-answer item in a
listening test. In the study, participants’ scores are higher in MCQ mainly due to guessing, memory
constraints, and the ability for test-takers to predict what is coming before listening. Another possible
reason is that MCQ strips away the ambiguity caused in most open-ended questions (Linn & Miller,
2005). MCQ appeals for use in testing receptive skills since they do not require the candidate to produce
the language (Samad, 2010). Hence, MCQ makes it possible to obtain accurate information on
candidate’s receptive skills and abilities, which would otherwise be difficult.
MCQ can be very useful for large-scale high-stake testing where scoring thousands of essays
would seem practically impossible (Hughes, 2003; McCoubrie, 2004; Woodford & Bancroft, 2005).
Considering this one benefit of MCQ, Ajideh and Rajab (2009) suggest that MCQ could substitute the
traditional cloze-test formats in vocabulary tests when number of test-takers need to be taken into
consideration. Most large-scale tests rely on MCQ because of its practicality and reliability (Cheng, 2004;
7
Lee, 2011). It is practical because scoring can be done quickly using a computer scanner. Furthermore,
with the increasing number of computer-based MCQs nowadays, testing time has been reported to be
significantly shorter (McCoubrie, 2004). MCQ is also reliable since no judgment has to be made by the
assessor of the test. High reliability is what sets MCQ apart from other test techniques. In scoring
hundreds of essays, for instance, there is a concern that the last few essays are not scored as reliably as
the first few due to the fatigue (Samad, 2010).
MCQ also allows those with knowledge but are poor writers to excel since what they are only
asked is to respond by marking their correct answers on the test (Tamil, 1991). Since it is time-efficient,
this means that testers may be able to cover a wide area of topics with more questions included in the
test, compared to other test techniques (McCoubrie, 2004). By using computer generated software, now
testers are even able to create a test item bank which can be instantly produced on demand, or even
reused for future testing (Hoshino & Nakagawa, 2005).
b. Arguments against multiple choice items
As mentioned previously, MCQ might be favorable in measuring listening skills. However, it might fall
short in accurately measuring language competence such as grammar (Currie & Chiramanee, 2010).
Candidates may be able to pick the right answer in an MCQ test, but to be able to perform it in written
or spoken language is an entirely different story. Hughes (2003) also questions the validity of one part in
TOEFL paper-based test where it aims to measure candidates’ writing ability, but it merely asks them to
identify grammatically incorrect language items in MCQ format. Most MCQ tests, thus, falls short in
achieving validity for the sake of reliability.
A scathing criticism that has often been leveled to MCQ is that there is always an element of
guessing that causes noise in language testing (Currie & Chiramanee, 2010: 487). In a four option
multiple choice option, for example, there is a 25% possibility of getting the correct answer through
8
guessing. In this sort of guessing game, candidates may be able to pick the correct answer without
having to read the passage or understanding the meaning in it (Lee, 2011: 31). Similarly, a correct
answer can also be the result of eliminating incorrect answers, without the candidates knowing the right
answer in the first place (Woodford & Bancroft, 2005).
Another most commonly cited disadvantage of MCQ is the difficulty in writing one (Burton et al,
1991; Linn & Miller, 2005). Hughes (2003) explains that MCQs are very difficult to write; a lot of careful
steps need to be accounted for in writing a good MCQ. It needs to be well-written, retested, and
statistically analyzed. Schutz et al. (2008) assert that coming up with good distractors (i.e. those that
could potentially be chosen by many candidates) contribute to the difficulty of writing MCQ. This means
that a great deal of effort and time need to go into the construction of MCQ.
There are many other disadvantages in using MCQ in a test. Hughes (2003) mentions some of
these shortcomings:
1. It measures only recognition knowledge: MCQ cannot directly measure the gap between
candidate’s productive and receptive skills. For instance, a candidate who gets the correct
answer in a grammar test may probably not be able to use the grammar item correctly in
speaking or writing. In other words, it only tells half the picture of someone’s ability in
language – the knowledge, not the use.
2. It limits what can be tested: MCQ may be limited to test language items that have a range of
available distractors. For example, while it is possible to test candidate’s knowledge on
English prepositions using many different variants of them as distractors, it would be
difficult to find distractors in asking the difference between present perfect and past simple.
This may lead to MCQ with poor distractors, which in turn may give hints to the correct
answer, without the candidates even having to guess. MCQ also lend itself very easily to
9
apply only Bloom’s taxonomy lower level thinking skills in the items (Paxton, 2000; Vleuten,
1999 as cited in McCoubrie, 2004), which in turn encourages student’s rote learning.
(Roediger, H. & Marsh, E., 2005; Tamir, 1999).
3. It may create a harmful backwash effect: high-stakes testing that relies heavily on MCQ may
create a negative effect on teaching and learning in class. Teachers would concentrate more
on the preparation of the exam such as training students how to answer MCQs with
educated guesses rather than actually getting students to improve their language.
4. It may facilitate cheating: It would be very easy for students to pass down the answers to
others by using non-verbal language such as body gesture.
IV. Discussion and conclusion
From the reviewed articles, it can be inferred that MCQ has both its strengths and weaknesses. MCQ’s
benefits lie in its high reliability, accuracy for receptive skills, practicality, quick scoring, and the ability to
include many question items. However, it has its shortcomings which include: harmful backwash,
facilitating guesswork and cheating, lack of validity, limitation to test what can be tested, and excluding
higher order thinking skills. Just as there is no universally correct or best ‘methodology’ in language
teaching (Brown, 2001), similarly in language testing we may never have one best ‘test techniques’. The
pros and cons outweighed so far mean that have to choose a different kind of test techniques for every
given situation. Thus, as pointed by Ory (n.d. as cited in Samad, 2010: 32) MCQ may be most
appropriate when:
1. A large number of students are concerned
2. The test questions are reused
3. The scores needed to be graded quickly
4. A wide of test coverage needs to be included
10
5. Lower thinking skills are of greater importance
After having the experience of writing MCQ as outlined previously in this paper, the writer has to agree
on the last point above that coming up with higher learning objectives such as application, synthesis, is
and evaluation are possible but might be difficult. This is probably what leads to a condition where
higher learning objectives are deemed necessary, but are omitted due to the implausibility of writing
one (Paxton, 2000). As mentioned earlier, many MCQ tests are produced poorly because of the difficulty
it presents and sadly where very little care and attention is given. This is clearly something to be
avoided. MCQ is not a shortcut for an easy and cheap way of testing. This is to say that although MCQ
still has its potentials in language testing, all kinds of ‘excessive, indiscriminate, and potentially harmful
use of technique’ should be avoided (Hughes, 2003: 78).
11
References
Ajideh,P. & Esfandiari, R. (2009). ‘A Close Look at the Relationship between Multiple Choice Vocabulary
Test and Integrative Cloze Test of Lexical Words in Iranian Context’, English Language Teaching,
Vol. 2(3), pp. 163-170.
Bloom's Taxonomy. (2012, March 9). In Wikipedia, The Free Encyclopedia. Retrieved March 10, 2012,
from http://en.wikipedia.org/w/index.php?title=Bloom%27s_Taxonomy&oldid=481064754
Brown, H. (2001). Teaching by Principles, 2nd edn., New York: Pearson Education.
Burton, S., Sudweeks, R., Merrill, P. & Wood, B. (1991). How to Prepare Better Multiple-Choice Test
Items: Guidelines for University Faculty. Retrieved from:
http://testing.byu.edu/info/handbooks/betteritems.pdf
Cheng, H. (2004). ‘A comparison of Multiple-Choice and Open-Ended Response Formats for the
Assessment of Listening Performance.’ Foreign Language Annuals, Vol. 37(4), pp. 544-555.
Currrie, M. & Chiramanee, T. (2010). ‘The effect of the multiple-choice item format on the measurement
of knowledge of language structure’, Language Testing, Vol. 27, pp. 471-491.
Hughes, A. (2003). Testing for Language Teachers (2nd Ed), Cambridge: Cambridge University Press.
Hoshino. A & Nagakawa, H. (2005). ‘A real-time multiple-choice question generation for language
testing’, Proceedings of the 2nd Workshop on Building Educational Applications Using NLP,
pp. 17-20.
Lee, J. (2011). Second Language Reading Topic Familiarity and Test Score: Test-Taking Strategies for
Multiple-Choice Comprehension Questions (PhD. Thesis). Retrieved from ProQuest Dissertations
and Theses. (Accession Order No. 3494064).
Linn, R. & Miller, M. (2005). Measurement and Assessment in Teaching (9th Ed). New Jersey: Pearson
Education.
McCoubrie, P. (2004). ‘Improving the fairness of multiple-choice questions: a literature review’, Medical
Teacher, Vol. 26(8), pp. 709-712.
Memory. (2012, March 8). In Wikipedia, The Free Encyclopedia. Retrieved March 10, 2012, from
http://en.wikipedia.org/w/index.php?title=Memory&oldid=480781062
Paxton, M. (2000). ‘A linguistic perspective on multiple choice questioning’, Assessment and Evaluation
in Higher Education, Vol. 25(2), pp.109-119.
Rodrieger, H. & Marsh, E. (2005). ‘The Positive and Negative Consequences of Multiple-Choice Testing’,
Journal of Experimental Psychology: Learning, Memory, and Cognition. Vol. 31(5), pp. 1155-1159.
12
Samad, A. (2010). Essentials of Language Testing for Malaysian Teachers. Selangor: Universiti Putra
Malaysia Press
Schutz, L., Rivers, K., Schutz, J. & Proctor, A.(2008). ‘Preventing Multiple-Choice Tests From impeding
Educational Advancement After Acquired Brain Injury’. Language, Speech & Hearing Services in
Schools, Vol. 39(1), pp. 104-109.
Sheridan, K. (2011, January 7). ‘Smart US dog learns more than 1,000 words’. In MNN, Mother Nature
Network. Retrieved March 10, 2012, from http://www.mnn.com/family/pets/stories/smart-dog-
learns-more-than-1000-words
Tamir, P. (1991). ‘Multiple Choice Items: How to Gain the Most Out of Them’, Biochemical Education,
Vol. 19(4). Pp. 188-192.
Woodford, K. & Bancroft, P. (2005). ‘Multiple Choice Questions Not Considered Harmful’. Australian
Computing Education Conference 2005 – Research and Practice in Information Technology, Vol. 42.
Pp.1-8.
13