designing a classroom test anthony paolo, phd director of assessment & evaluation office of...

Designing a Classroom TestAnthony Paolo, PhD

Director of Assessment & Evaluation Office of Medical Education

&Psychometrician for CTC

Teaching & Learning TechnologiesSeptember 2008

Content

• Purpose of classroom test• Test blueprint & specifications• Item writing• Assembling the test• Item analysis

Purpose of Classroom Test• Establish basis for assigning grades• Determine how well each student has

achieved course objectives• Diagnose student problems• Identify areas where instruction needs

improvement• Motivate students to study• Communicate what material is important

Test Blueprint• To ensure the test assesses what you

want to measure• Ensure the test assesses the level or

depth of learning you want to measure

Bloom’s Revised Cognitive Taxonomy• Remembering & Understanding

– Remembering: Retrieving, recognizing, recalling relevant knowledge. – Understanding: Constructing meaning from information through

interpreting, classifying, summarizing, inferring, explaining. • ITEM TYPES: MC, T/F, Matching, Short Answer

• Applying & Analyzing– Applying: Implementing a procedure or process. – Analyzing: Breaking material into constituent parts, determining how the

parts relate to one another and to an overall structure or purpose through differentiating, organizing, and attributing.

• ITEM TYPES: MC, Short Answer, Problems, Essay• Evaluating & Creating

– Evaluating: Making judgments based on criteria & standards through checking and critiquing.

– Creating: Putting elements together to form a coherent or functional whole; reorganizing elements into a new pattern or structure through generating, planning, or producing.

• ITEM TYPES: MC, Essay

Test Blueprint

Learning Level(Number of test items)

Content/Objective

Knows facts

(Recall)

Understanding Applies Principles

(Application)

Total

Kreb Cycle 3 5 2 10

Aquaporins 2 2 4 10

Cell Types 5 0 0 5

Total 10 7 8 25

Test Specifications

• To ensure the test covers the content and/or objectives in the proper proportions

Test SpecificationsTopics Time spent

on Topic% of total class time

Number (%) of test

itemsKreb Cycle 10 hrs 40% 10 (40%)

Aquaporins 10 hrs 40% 10 (40%)

Cell Types 5 hrs 20% 5 (20%)

Total 25 hrs 100% 25 (100%)

Item Writing – General Guidelines1

• Present a single clearly defined problem that is based on a significant concept rather then trivial or esoteric ideas

• Use simple, precise & unambiguous wording

• Exclude extraneous or irrelevant information

• Eliminate any systematic pattern of answers that may allow guessing correctly


• Avoid cultural, racial, ethnic & sexual bias. • Avoid presupposed knowledge which

favors one group over another (“fly ball” favors those that know baseball)

• Refrain from providing unnecessary clues to the correct answer.

• Avoid negatively phrased items (i.e., except, not)

• Arrange answers in alphabetical / numerical order


• Avoid “None of the above” or “All of the above” type answers

• Avoid “Both A & B” or “Neither A or B” type answers

Item Writing – Correct Answer is

• Longer• More qualified or more general• Uses familiar phraseology• Is grammatically correct for item stem• Is 1 of the 2 similar statements• Is 1 of the 2 opposite statements

Item Writing – Wrong Answer is

• Usually the first or last option• Contain extreme words (always, never,

nonsense, etc.)• Contain unexpected language or technical

terms• Contain flippant remarks or completely

unreasonable statements

Item Writing – Grammatical Cues

Item Writing – Logical Cues

Item Writing – Absolute Terms

Item Writing – Word Repeats

Item Writing – Vague Terms

Item Writing• Effective test items match the desired

depth of learning as directly as possibleApplying & Analyzing

• Applying: Implementing a procedure or process. • Analyzing: Breaking material into constituent parts,

determining how the parts relate to one another and to an overall structure or purpose through differentiating, organizing, and attributing.

– ITEM TYPES: MC, Short Answer, Problems, Essay

Comparison of MC & Essay1

Essay MC

Depth of learning

Can measure application and more complex outcomes. Poor for recall.

Can be designed to measure application and more complex outcomes as well as recall.

Item prep Fewer test items, less prep time

Relatively large number of items, more prep time

Content sampling

Limited, few items Broader content sampling

Comparison of MC & Essay2

Essay MC

Encouragement Encourages organization, integration & effective expression of ideas

Encourages development of broad background of knowledge & abilities

Scoring Time consuming, requires special measures for consistent results

Easy to score with consistent results.

Item Writing - ApplicationMC application of knowledge items tend to

have long vignettes that require decisions.

Case, et al. at the NBME investigated the impact of increasing levels of interpretation, analysis and synthesis required to answer a question on item performance.

(Academic Medicine, 1996;71:528-530)

Item Writing - Application

Preparing & Assembling the Test• Provide general directions

– Time allowed (allow enough time to complete test)– How items are scored– How to record answers– How to record name /ID

• Arrange items systematically• Provide adequate space for short answer and

essay responses• Placement of easier & harder items

Interpreting test scoresTeachers

High scores = good instructionLow scores = poor students

StudentsHigh scores = smart, well-preparedLow scores = poor teaching, bad test

Interpreting test scoresHigh scores

too easy, only measured simple educational objectives, biased scoring, cheating, unintentional clues to right answers

Low scorestoo hard, tricky questions, content not covered in class, grader bias, insufficient time to complete test

Item Analysis• Main purpose of item analysis is to

improve the test• Analyze items to identify:

• Potential mistakes in scoring• Ambiguous/tricky items• Alternatives that do not work well• Problems with time limits

Reliability• The reliability of a test refers to the extent to

which a test is likely to produce consistent results. • Test-Retest• Split-Half• Internal consistency

• Reliability coefficients range from 0 (no reliability) to 1 (perfect reliability)

• Internal consistency usually measured by Kuder-Richardson 20 (KR-20) or Cronbach’s coefficient alpha

Internal Consistency Reliability• High reliability means that the questions of

the test tended to hang together. Students that answered a given question correctly were more likely to answer other questions correctly.

• Low reliability means that the questions tended to be unrelated to each other in terms of who answered them correctly.

Reliability Coefficient Interpretation

General guidelines for homogeneous tests• .80 and above – Very good reliability• .70 to .80 – Good reliability, a few test

items may need to be improved• .50 to .70 – Somewhat low, several items

will likely need improvement (unless short test 15 or fewer items)

• .50 and below – Questionable reliability, test likely needs revision

Item difficulty1

• Proportion of students that got the item correct (ranges from 0% to 100%)

• Helps evaluate if an item is suited to the level of examinee being tested.

• Very easy or very hard items cannot adequately discriminate between student performance levels.

• Spread of student scores is maximized with items of moderate difficulty.

Item difficulty2

• Moderate item difficulty is the point halfway between a perfect score and a chance score.

Item format Moderate Difficulty level

4-option MC 63%

5-option MC 60%

10-option MC 55%

Item discrimination1

• How well does the item separate those that know the material from those that do not.

• In LXR, measured by the Point-Biserial (rpb) correlation (ranges from -1 to 1).

• rbp is the correlation between item and exam performance

Item discrimination2

• + rpb means that those scoring higher on the exam were more likely to answer the item correctly. (better discrimination)

• - rpb means that high scorers on the exam answered the item wrong more frequently than low scorers. (poor discrimination)

• A desirable rpb correlation is +0.20 or higher.

Evaluation of Distractors • Distractors are designed to fool those that

do not know the material. Those that do not know the answer, guess among the choices.

• Distractors should be equally popular.(# expected = # answered item wrong / # of distractors)

• Distractors ideally have a low or -rpb

LXR Example 1(* correct answer)

A* B C D EN 86 0 0 1 0% 99% 0% 0% 1% 0%

Avg % Correcton Exam

85.3% 0% 0% 82.0% 0%

rpb +.06 ---- --- -.06 ---

Very easy item, would probably review the alternates to make sure they are not ambiguous and/or provide clues that they are wrong.


A B C* D E

N 0 21 65 2 0% 0% 24% 74% 2% 0%

Avg % Correct on Exam 0% 80.7% 87.2% 78.7% 0%

rpb --- -.33 +.36 -.13 ---

Three of the alternatives are not functioning well, would review them.


A B C* D E

N 3 1 15 5 66

% 3% 1% 17% 6% 76%

Avg % Correct on Exam 83.0% 80.0% 83.4% 82.2% 86.8%

rpb -.07 -.09 -.15 -.12 +.23

Probably a miskeyed item. The correct answer is likely option E.


A B* C D E

N 11 43 3 22 8

% 13% 49% 3% 25% 9%


rpb -.24 +.35 -.09 -.08 -.15

Relatively hard item with good discrimination. Would review alternatives C & D to see why they attract a relatively low & high number of students.


A B* C D E

N 3 60 1 5 18

% 3% 69% 1% 6% 21%


rpb -.07 +.002 -.09 -.12 +.13

Poor discrimination for correct choice “B”. Choice “E” actually does a better job discriminating. Would review item for proper keying, ambiguous wording, proper wording of alternatives, etc. This item needs revision.

ResourcesConstructing Written Test Questions for the

Basic and Clinic Sciences (www.nbme.org)

How to Prepare Better Multiple-Choice Test Items: Guidelines for University Faculty (Brigham Young University: (testing.byu.edu/info/handbooks/betteritems.pdf)

Thank you for your time

Questions ???

designing a classroom test anthony paolo, phd director of assessment & evaluation office of...

Documents