choose the best items: a basic psychometric toolkit for testmakers warren lambert vanderbilt kennedy...
Post on 30-Mar-2015
217 Views
Preview:
TRANSCRIPT
Choose the best items: A basic psychometric toolkit for testmakers
Warren LambertVanderbilt Kennedy Center
February 2007
Examples of Recent Test Development by KC Investigators
Peabody
Two different tests of school-based reading ability
A test of school-based math skills
Very early signs of autism spectrum in infants
A battery of 10 new tests for tracking mental health treatment of children
VUMC
Somatizing in children with recurrent abdominal pain
Survey of attending MD satisfaction with a department in hospital
Psychological rigidity in children
Goal of Today’s Session
Provide tools for people making their first index or test.
What Is a “Test”
Could be questionnaire
A set of items in a structured interview
Signs & symptoms of something
Often a “fuzzy” construct with numerous imperfect indicators, e.g. Beck Depression Inventory, SF-36, CBCL
Tests gain reliability by combining imperfect items into a total score. The sum of items will be more reliable than any single item.
A test is a set of items that produces a total score
How to Identify the Best ItemsA toolkit, not an analytic plan
Flag weaker items to drop or revise
Identify the weaker items
Relative, not absolute criteria
Classical test theory Enough for most medical research
Floors or ceilings restrict variance
To increase Cronbach’s alpha, avoid low item-total correlations
Guesstimate test length with Spearman-Brown formula
Factor analysis (exploratory and confirmatory)
Are there items that don’t fit the construct?
Avoid items that do not load on the main factor
See how well a confirmatory model fits
Rasch modeling
Pick items that fit a carefully considered measurement model
Consider item difficulties more deeply
Pick items that suit the intended task
Informal
Formal
Psychometrics vs Statistics
Statistics:
Find a statistical model that fits your data
Psychometric test construction:
Find data that fits your statistical model
Choose sound measurement models and pick items that fit by dropping weaker items.
Classical Test Theory (CTT)
Basic description of items
Can be done with SAS SPSS STATA S+ R etc
Do this routinely with scales old and new
Informal test development e.g. one-shot ad hoc index for an article
Not enough for tests that will be widely used in many settings
Note Floors or CeilingsThe “Too Short” IQ Test (TS-IQ)
Low mean, SD, variance all indicate floors or ceilings, but kurtosis is very easy to spot.
The “Too Short” IQ Test data set with SAS and SPSS code available for downloadhttp://kc.vanderbilt.edu/quant/Seminar/schedule.htm
Hard, Medium, & Easy Items#1, #6, #10
Wrong Right0.0
0.2
0.4
0.6
0.8
1.0
Wrong Right Wrong Right
Measuring entire population requires a range of item difficulties. If everyone has the same score, the item gives no information.
Kurtosis: 11 -2 3
Floor
Ceiling
Use Excel Conditional Formatting to Flag Problems
Retain Flagged Estimates of Quality“Too Short IQ Test” (TS-IQ)
Variable Mean Kurtosis
Item01 0.06 11.16
Item02 0.22 -0.11
Item03 0.35 -1.61
Item04 0.39 -1.82
Item05 0.45 -1.96
Item06 0.49 -2.01
Item07 0.54 -1.99
Item08 0.58 -1.91
Item09 0.77 -0.29
Item10 0.86 2.60
Item-total CorrelationsHow can you add unrelated things into a single total??
If an item is uncorrelated with other items, it doesn’t contribute to the internal-consistency reliability of the total score
Software packages like SAS SPSS etc will do item-total correlations very easily
Good check to use routinely
Biological Age Index (Frailty) Negative Item-Total Correlations Are Bad
Forgot to “flip” items on left
Correlation with Total High is Label
Correlationwith Total
0.57 Good Feet Walked In Six Minutes 0.68
0.42 Good Rank For Variable Foot 0.59
0.30 Good Times Weight Lifting 0.54
-0.40 Bad Seconds Trail B 0.44
0.42 Good Standing Forward Bend 0.41
0.38 Good Tinetti Balance Score 0.38
-0.30 Bad GDS Depression (High=Sad) 0.38
0.20 Good Sum of 3 Exercise Measures 0.35
-0.20 Bad Charlson Comorbidity Index 0.32
-0.12 Bad Mini Mental State Examination 0.15
-0.09 Bad Body Mass Index 0.02
Make sure all items are high-is-good or high-is-bad
Goffaux, J., G. C. Friesinger, Lambert, E.W. et al. (2005). "Biological age--a concept whose time has come: a preliminary study." South Med J 98(10): 985-93.
“TS-IQ,” Low Item-Total r’s are BadSPSS Reliability or SAS PROC CORR ALPHA
SPSS Relilability
“Too Short” Item-Total Correlations
•Items with nothing in common would not have a reliable total score
•Cronbach’s alpha internal consistency reliability
•Reliability increases with high item-total correlations
Item Mean Kurtosisr(Item-Total)
Item01 0.06 11.16 0.30
Item02 0.22 -0.11 0.53
Item03 0.35 -1.61 0.53
Item04 0.39 -1.82 0.43
Item05 0.45 -1.96 0.53
Item06 0.49 -2.01 0.60
Item07 0.54 -1.99 0.59
Item08 0.58 -1.91 0.36
Item09 0.77 -0.29 0.49
Item10 0.86 2.60 0.39
How Many Items?Spearman-Brown’s Predicted Reliability = F(N Items)
N Items
5 10 15 20 25 30 35P
rte
dic
ted
Rlia
bili
ty
0.0
0.2
0.4
0.6
0.8
1.0
Classical Test Theory: Reliability increases with the number of items
Put the the S-B formula into Excel to see approximately how many items you need for desired reliability under CTT.
Brown, W. (1910). Some experimental results in the correlation of mental abilities. British Journal of Psychology, 3, 296-322.Spearman, C. (1910). Correlation calculated from faulty data. British Journal of Psychology, 3, 171-195.
How Much is Enough?
• For local use of an ad hoc research index, CTT may suffice
• Formal tests (available for general use) require more thorough psychometric analysis
• Factor analysis and Item Response Theory modeling
Factor Analysis (FA)Beginning formal test development
Goal is to make sure the test’s theory foundations agree with the test data
We want to produce one or more single-factor tests
Use EFA (exploratory factor analysis) and
CFA (confirmatory factor analysis)
Scree Plot of “TS-IQ”
•Run a principal components analysis with SAS, SPSS etc
•“Scree” plot of eigenvalues
•Cattell’s metaphor, a mountain rising above useless rubble
•Is there more than one big component?
•Hard to get multiple factors (subtests) from the “Too short IQ test”
•Kaiser criterion, min eigenvalue > 1 extremely liberal, makes unstable factorsComponent Number
1 2 3 4 5 6 7 8 9 10
Eig
en
valu
e
0
1
2
3
4
5
1
2 3
Formal Test ConstructionVUMC Pediatric Researcher’s Three Samples
1. N = 181 children rating understandability of items
2. N = 513 Psychometric sample 1
3. Psychometric sample 2, N = 675
2a. Random 50% N=346 exploratory sample (CTT, EFA)
2b. Random 50% confirmatory sample (CTT, CFA)
Confirmatory Factor Analysis
See how well (ha! how badly!) the data fit a theory-driven model: “factorial validity”
Theory: TS-IQ measures a single dimension of intelligence.
Run a measurement model
Look at fit indices
Very popular in psychology, rarely done in nonpsychiatric medicine (exception: SF-36 has extensive psychometric analysis)
“Too Short IQ” SAS CFA of single-factor measurement model
RMSEA < .05, CFI > 0.95 or 0.96 (very high standards of unidimensionality)
Warning: So far, most VU tests early in their development haven’t met the high standards for measurement model fit.
Run Rasch or IRTIRT, Item Response Theory
Rasch: One parameter logistic model
Good for practical test development (converges)
E.g. Winsteps ($100 or $200)
Item Response Theory (IRT)
1-2-3 parameter models
Good for research
Need large samples
E.g. Parscale, Bilog-MG, Multilog ($100 VU site license)
P = Prob of getting item i “right”
Theta = persons ability
B = item’s difficulty on same scale
Rasch Model
• “Measure score” for person and item in same units
• If you’re better than the item, p (right) > 50%
• 1 Parm logistic model
As (Person – Item) increases, prob (right) increases in logistic model.
Rasch ModelItems spread over a range of difficulties
http://en.wikipedia.org/wiki/Rasch_model
Easy items
Hard items
WINSTEPSOne-parameter Rasch program(see http://www.winsteps.com)
$200 ($99 on summer sale)
“TS IQ” Items Information Spread Across Whole Range
Easy items, like #10, are most informative about low scoring individuals
Hard items, like #1, are most informative about high scoring individuals.
This test’s items spread to describe whole range of IQs
Persons & Items on One Scale
Rasch model measures each item and each person on the same scale
Concentrate your items where they are needed
Measure everyone
Measure high clinical cases most efficiently
TS-IQ measures across a wide range
VUMC Clinical Test Focuses on CutpointUnlike the TS-IQ
• School sample• High is bad (sicker)• Clinical screens focus
on sick people• Classify: treat yes-no• Job is to be maximally
informative at the cutpoint
• This test invests its items in severe range
Putting It All TogetherTS-IQ’s Items and Total
Putting It All TogetherVUMC Pediatrics
Items go 0-4
Many items near the floor (LE 1)
The lowest few have excessive kurtosis
However many item-total rs and Rasch fit stats are OK
Test maker can shorten this with considerable latitude, e.g. with content analysis.
Putting It All Together
Test has one odd item that measures something else.
Drop or revise that item.
How to Identify the Best ItemsA toolkit, not an analytic plan
Flag weaker items to drop or revise
Identify the weaker
Relative, not absolute criteria
Classical test theory Enough for most medical research
Floors or ceilings restrict variance
To increase Cronbach’s alpha, avoid low item-total correlations
Guesstimate test length with Spearman-Brown formula
Factor analysis (exploratory and confirmatory)
Are there items that don’t fit the construct?
Avoid items that do not load on the main factor
See how well a confirmatory model fits
Rasch modeling
Pick items that fit a carefully considered measurement model
Consider item difficulties more deeply
Pick items that suit the intended task
Informal
Formal
top related