designing a classroom test anthony paolo, phd director of assessment & evaluation office of...
DESCRIPTION
Purpose of Classroom Test Establish basis for assigning grades Determine how well each student has achieved course objectives Diagnose student problems Identify areas where instruction needs improvement Motivate students to study Communicate what material is importantTRANSCRIPT
Designing a Classroom TestAnthony Paolo, PhD
Director of Assessment & Evaluation Office of Medical Education
&Psychometrician for CTC
Teaching & Learning TechnologiesSeptember 2008
Content
• Purpose of classroom test• Test blueprint & specifications• Item writing• Assembling the test• Item analysis
Purpose of Classroom Test• Establish basis for assigning grades• Determine how well each student has
achieved course objectives• Diagnose student problems• Identify areas where instruction needs
improvement• Motivate students to study• Communicate what material is important
Test Blueprint• To ensure the test assesses what you
want to measure• Ensure the test assesses the level or
depth of learning you want to measure
Bloom’s Revised Cognitive Taxonomy• Remembering & Understanding
– Remembering: Retrieving, recognizing, recalling relevant knowledge. – Understanding: Constructing meaning from information through
interpreting, classifying, summarizing, inferring, explaining. • ITEM TYPES: MC, T/F, Matching, Short Answer
• Applying & Analyzing– Applying: Implementing a procedure or process. – Analyzing: Breaking material into constituent parts, determining how the
parts relate to one another and to an overall structure or purpose through differentiating, organizing, and attributing.
• ITEM TYPES: MC, Short Answer, Problems, Essay• Evaluating & Creating
– Evaluating: Making judgments based on criteria & standards through checking and critiquing.
– Creating: Putting elements together to form a coherent or functional whole; reorganizing elements into a new pattern or structure through generating, planning, or producing.
• ITEM TYPES: MC, Essay
Test Blueprint
Learning Level(Number of test items)
Content/Objective
Knows facts
(Recall)
Understanding Applies Principles
(Application)
Total
Kreb Cycle 3 5 2 10
Aquaporins 2 2 4 10
Cell Types 5 0 0 5
Total 10 7 8 25
Test Specifications
• To ensure the test covers the content and/or objectives in the proper proportions
Test SpecificationsTopics Time spent
on Topic% of total class time
Number (%) of test
itemsKreb Cycle 10 hrs 40% 10 (40%)
Aquaporins 10 hrs 40% 10 (40%)
Cell Types 5 hrs 20% 5 (20%)
Total 25 hrs 100% 25 (100%)
Item Writing – General Guidelines1
• Present a single clearly defined problem that is based on a significant concept rather then trivial or esoteric ideas
• Use simple, precise & unambiguous wording
• Exclude extraneous or irrelevant information
• Eliminate any systematic pattern of answers that may allow guessing correctly
Item Writing – General Guidelines2
• Avoid cultural, racial, ethnic & sexual bias. • Avoid presupposed knowledge which
favors one group over another (“fly ball” favors those that know baseball)
• Refrain from providing unnecessary clues to the correct answer.
• Avoid negatively phrased items (i.e., except, not)
• Arrange answers in alphabetical / numerical order
Item Writing – General Guidelines3
• Avoid “None of the above” or “All of the above” type answers
• Avoid “Both A & B” or “Neither A or B” type answers
Item Writing – Correct Answer is
• Longer• More qualified or more general• Uses familiar phraseology• Is grammatically correct for item stem• Is 1 of the 2 similar statements• Is 1 of the 2 opposite statements
Item Writing – Wrong Answer is
• Usually the first or last option• Contain extreme words (always, never,
nonsense, etc.)• Contain unexpected language or technical
terms• Contain flippant remarks or completely
unreasonable statements
Item Writing – Grammatical Cues
Item Writing – Logical Cues
Item Writing – Absolute Terms
Item Writing – Word Repeats
Item Writing – Vague Terms
Item Writing – Vague Terms
Item Writing• Effective test items match the desired
depth of learning as directly as possibleApplying & Analyzing
• Applying: Implementing a procedure or process. • Analyzing: Breaking material into constituent parts,
determining how the parts relate to one another and to an overall structure or purpose through differentiating, organizing, and attributing.
– ITEM TYPES: MC, Short Answer, Problems, Essay
Comparison of MC & Essay1
Essay MC
Depth of learning
Can measure application and more complex outcomes. Poor for recall.
Can be designed to measure application and more complex outcomes as well as recall.
Item prep Fewer test items, less prep time
Relatively large number of items, more prep time
Content sampling
Limited, few items Broader content sampling
Comparison of MC & Essay2
Essay MC
Encouragement Encourages organization, integration & effective expression of ideas
Encourages development of broad background of knowledge & abilities
Scoring Time consuming, requires special measures for consistent results
Easy to score with consistent results.
Item Writing - ApplicationMC application of knowledge items tend to
have long vignettes that require decisions.
Case, et al. at the NBME investigated the impact of increasing levels of interpretation, analysis and synthesis required to answer a question on item performance.
(Academic Medicine, 1996;71:528-530)
Item Writing - Application
Item Writing - Application
Item Writing - Application
Preparing & Assembling the Test• Provide general directions
– Time allowed (allow enough time to complete test)– How items are scored– How to record answers– How to record name /ID
• Arrange items systematically• Provide adequate space for short answer and
essay responses• Placement of easier & harder items
Interpreting test scoresTeachers
High scores = good instructionLow scores = poor students
StudentsHigh scores = smart, well-preparedLow scores = poor teaching, bad test
Interpreting test scoresHigh scores
too easy, only measured simple educational objectives, biased scoring, cheating, unintentional clues to right answers
Low scorestoo hard, tricky questions, content not covered in class, grader bias, insufficient time to complete test
Item Analysis• Main purpose of item analysis is to
improve the test• Analyze items to identify:
• Potential mistakes in scoring• Ambiguous/tricky items• Alternatives that do not work well• Problems with time limits
Reliability• The reliability of a test refers to the extent to
which a test is likely to produce consistent results. • Test-Retest• Split-Half• Internal consistency
• Reliability coefficients range from 0 (no reliability) to 1 (perfect reliability)
• Internal consistency usually measured by Kuder-Richardson 20 (KR-20) or Cronbach’s coefficient alpha
Internal Consistency Reliability• High reliability means that the questions of
the test tended to hang together. Students that answered a given question correctly were more likely to answer other questions correctly.
• Low reliability means that the questions tended to be unrelated to each other in terms of who answered them correctly.
Reliability Coefficient Interpretation
General guidelines for homogeneous tests• .80 and above – Very good reliability• .70 to .80 – Good reliability, a few test
items may need to be improved• .50 to .70 – Somewhat low, several items
will likely need improvement (unless short test 15 or fewer items)
• .50 and below – Questionable reliability, test likely needs revision
Item difficulty1
• Proportion of students that got the item correct (ranges from 0% to 100%)
• Helps evaluate if an item is suited to the level of examinee being tested.
• Very easy or very hard items cannot adequately discriminate between student performance levels.
• Spread of student scores is maximized with items of moderate difficulty.
Item difficulty2
• Moderate item difficulty is the point halfway between a perfect score and a chance score.
Item format Moderate Difficulty level
4-option MC 63%
5-option MC 60%
10-option MC 55%
Item discrimination1
• How well does the item separate those that know the material from those that do not.
• In LXR, measured by the Point-Biserial (rpb) correlation (ranges from -1 to 1).
• rbp is the correlation between item and exam performance
Item discrimination2
• + rpb means that those scoring higher on the exam were more likely to answer the item correctly. (better discrimination)
• - rpb means that high scorers on the exam answered the item wrong more frequently than low scorers. (poor discrimination)
• A desirable rpb correlation is +0.20 or higher.
Evaluation of Distractors • Distractors are designed to fool those that
do not know the material. Those that do not know the answer, guess among the choices.
• Distractors should be equally popular.(# expected = # answered item wrong / # of distractors)
• Distractors ideally have a low or -rpb
LXR Example 1(* correct answer)
A* B C D EN 86 0 0 1 0% 99% 0% 0% 1% 0%
Avg % Correcton Exam
85.3% 0% 0% 82.0% 0%
rpb +.06 ---- --- -.06 ---
Very easy item, would probably review the alternates to make sure they are not ambiguous and/or provide clues that they are wrong.
LXR Example 2(* correct answer)
A B C* D E
N 0 21 65 2 0% 0% 24% 74% 2% 0%
Avg % Correct on Exam 0% 80.7% 87.2% 78.7% 0%
rpb --- -.33 +.36 -.13 ---
Three of the alternatives are not functioning well, would review them.
LXR Example 3(* correct answer)
A B C* D E
N 3 1 15 5 66
% 3% 1% 17% 6% 76%
Avg % Correct on Exam 83.0% 80.0% 83.4% 82.2% 86.8%
rpb -.07 -.09 -.15 -.12 +.23
Probably a miskeyed item. The correct answer is likely option E.
LXR Example 4(* correct answer)
A B* C D E
N 11 43 3 22 8
% 13% 49% 3% 25% 9%
Avg % Correct on Exam 81.5% 87.4% 82.3% 84.5% 82.4%
rpb -.24 +.35 -.09 -.08 -.15
Relatively hard item with good discrimination. Would review alternatives C & D to see why they attract a relatively low & high number of students.
LXR Example 5(* correct answer)
A B* C D E
N 3 60 1 5 18
% 3% 69% 1% 6% 21%
Avg % Correct on Exam 83.0% 85.3% 80.0% 82.2% 86.8%
rpb -.07 +.002 -.09 -.12 +.13
Poor discrimination for correct choice “B”. Choice “E” actually does a better job discriminating. Would review item for proper keying, ambiguous wording, proper wording of alternatives, etc. This item needs revision.
ResourcesConstructing Written Test Questions for the
Basic and Clinic Sciences (www.nbme.org)
How to Prepare Better Multiple-Choice Test Items: Guidelines for University Faculty (Brigham Young University: (testing.byu.edu/info/handbooks/betteritems.pdf)
Thank you for your time
Questions ???