1 choosing reliable items: an objective multi-model approach to applied psychometrics warren lambert...
TRANSCRIPT
1
Choosing Reliable Items:An objective multi-model approach to applied psychometrics
Warren LambertPeabody College & Vanderbilt Kennedy Center
Measurement 2.0 Conference, Salt Lake City, February 2009
2
.
Len Bickman’s Peabody Treatment Progress Battery
To give away a suite of tools to evaluate client progress in counseling, we had to develop 17 “new” tests. This required a formal systematic approach.
http://peabody.vanderbilt.edu/Microsites/Center/Center_for_Evaluation_and_Program_Improvement_(CEPI)/The_Peabody_Treatment_Progress_Battery_(PTPB).xml
3
Statistical Approach: Imperfect Complementary Models
NETFLIX Winners, “We found it was important to utilize a variety of models that complement the shortcomings of each other. . . .
Lessons Learned. . . the best predictive performance came from combining complementary models.”
CHASING $1,000,000: HOW WE WON THE NETFLIX PROGRESS PRIZE
Robert Bell, Yehuda Koren, and Chris VolinskyAT&T Labs – Research
VOLUME 18, NO 2, DEC. 2007
4
How to Identify Reliable ItemsClassical test theory Enough for one-shot ad hoc indices
Floors or ceilings restrict variance
Look at a PCA
To increase Cronbach’s alpha, avoid low item-total correlations
Guesstimate test length with Spearman-Brown formula
Factor analysis (confirmatory if at all possible)
See how well a 1-factor confirmatory model fits
Factorial “validity,” does the factor structure fit theory?
Rasch (IRT) modeling
Pick items that fit a carefully considered measurement model
Consider item difficulties more deeply
Pick items suited to the intended task
Informal
Formal
5
Classical Test Theory (CTT)Tools of CTT
Basic description of items & their correlations
Cronbach’s alpha, internal-consistency reliability
Corrected item-total correlations
Principal components (PCA)
Spearman-Brown test length estimation
CTT is good to do routinely with index scores
OK for informal test development e.g., one-shot ad hoc index
Insufficient for tests that will be published for wide use
6
Note Floors or CeilingsThe “Too Short” IQ Test (TS-IQ)
Low mean, SD, variance all indicate floors or ceilings, but outrageous kurtosis is easy to see.
7
Acorn 10 item scale and 3 item index
8
Retain Flagged Estimates of Item Quality“Too Short IQ Test” (TS-IQ)
Variable Mean Kurtosis
Item01 0.06 11.16
Item02 0.22 -0.11
Item03 0.35 -1.61
Item04 0.39 -1.82
Item05 0.45 -1.96
Item06 0.49 -2.01
Item07 0.54 -1.99
Item08 0.58 -1.91
Item09 0.77 -0.29
Item10 0.86 2.60
9
Raw Scree Plots for ComparisonCompare Result to Random Shadow
• Simple principal components (Pearson, 1901)
• 10,000 PCAs on random numbers
• Same size data set• Half page R code• Visually distinguish
chance effects• Falls short of a
confirmatory factor analysis
Greco, L. A., Lambert, W., & Baer, R. A. (2008). Psychological inflexibility in childhood and adolescence: Development and evaluation of the Avoidance and Fusion Questionnaire for Youth. Psychological Assessment, 20(2), 93-102.
10
“Too Short IQ” Spreadsheet•Items with something in common contribute to a reliable total score
•Cronbach’s alpha internal consistency reliability
•Reliability increases with high item-total correlations
•Reliability increases with test length
Item Mean Kurtosisr(Item-Total)
Item01 0.06 11.16 0.30
Item02 0.22 -0.11 0.53
Item03 0.35 -1.61 0.53
Item04 0.39 -1.82 0.43
Item05 0.45 -1.96 0.53
Item06 0.49 -2.01 0.60
Item07 0.54 -1.99 0.59
Item08 0.58 -1.91 0.36
Item09 0.77 -0.29 0.49
Item10 0.86 2.60 0.39
11
How to Identify Reliable Items 2Classical test theory Enough for one-shot ad hoc indices
Floors or ceilings restrict variance
Look at a PCA
To increase Cronbach’s alpha, avoid low item-total correlations
Guesstimate test length with Spearman-Brown formula
Factor analysis (confirmatory if at all possible)
See how well a 1-factor confirmatory model fits
Factorial “validity,” does the factor structure fit theory?
Rasch (IRT) modeling
Pick items that fit a carefully considered measurement model
Consider item difficulties more deeply
Pick items suited to the intended task
Informal
Formal
12
Confirmatory Factor Analysis (CFA)
See how well (ha! how badly!) the data fit a theory-driven model (factorial “validity”)
Theory: TS-IQ measures g, a single dimension of intelligence.
Evaluate the fit of a single factor measurement model
CFA, popular in psychology, seldom done in non-psychiatric medicine (exception: Quality of life indices have extensive psychometric analysis using all current methods)
13
“Too Short IQ” SAS CFA of single-factor measurement model
RMSEA < .05, CFI > 0.95 or 0.96 (high standards of model fit)
So far, most VU tests early in development fail to meet the high standards for
measurement model fit.
SAS PROC CALIS, old fashioned but (more or less) useable
14
Rasch or IRT ModelIRT, Item Response Theory
Rasch: One parameter logistic IRT model
Good for practical test development (converges)
Multi-parameter Item Response Theory (IRT)
2-3 parameter models (discrimination, guessing)
For measurement research
Software, e.g. R, MPLUS, Parscale, Bilog-MG, user-written procs
P = Prob of getting item i “right”
Theta = persons ability
b = item’s difficulty on same scale
15
Rasch Model
• “Measure score” for person and item in same units
• If your measure = item’s measure, p(right) = 50%
• If you’re better than the item, p (right) > 50%
• 1 Parm logistic model (1PLM)
As (Person – Item) increases, prob (correct) increases in logistic model.
16
Rasch (1960/1980) model
Simple 1PLM, can use conventional total score or table lookup
Parallel logistic curves for items
Good for practical test construction (WINSTEPS)
Software in development > 20 years
IRT 2PLM, 3PLM may be better for certain kinds of measurement research
Rasch, G. (1960/1980). Studies in mathematical psychology: I. Probabilistic models for some intelligence and attainment tests (Expanded ed.). Chicago: University of Chicago Press.
Statistician, Danish student of RA Fisher
Rasch Model:TS-IQ Items Cover a Range of Difficulties
17
“Too Short IQ” Items Information Spread Across Whole Range
Easy items, like #10, are most informative about low scoring individuals
Hard items, like #1, are most informative about high scoring individuals.
This test’s items spread to describe whole range of IQs
18
IRT: Compare Items with PeopleClinically Targeted Test (VUMC Greco)
• Items gray, people black• School sample• High is bad (sicker)• Clinical screens focus on
sick people• Classify: treat yes-no• Job is to be maximally
informative at the cutpoint• This test invests its items in
severe range
Greco, L. A., Lambert, W., & Baer, R. A. (2008). Psychological inflexibility in childhood and adolescence: Development and evaluation of the Avoidance and Fusion Questionnaire for Youth. Psychological Assessment, 20(2), 93-102.
19
•Left, distribution of children (each # = 3)
•Right, distribution of items
•Centerline, measure score, theta for people
and “difficulty” for items
•Self-harm item, a severe outlier
•9 Items concentrated in low-average range
•Are they concentrated near the clinical-normal
threshold?
Acorn 10 item scale
20
Putting It All TogetherToo Short-IQ’s Items and Total
21
Putting it all together (Walker’s CSI)Multiple criteria converge => firm conclusion without definitive cutoffs or perfect models
Items scored 0-4
Items 1-35
Walker, Lynn S., Beck, Joy E., Garber, Judy, & Lambert, Warren. The Children’s Somatization Inventory: Psychometric Properties of the Revised Form (CSI-24) and Evidence for a Continuum of Symptom Reporting in Youth. In press, J. Pediatric Psychology.
22
Bold items, some concern
Self-harm, having a low mean, shows some roughness (no fatal flaws).
Infit/outfit flags are borderline. Good is now 0.7-1.3, used to be 0.5-1.5
A&D items are near the floor, but still seem to work.
Acorn 10 item scale and 3 item index
23
•10 item scale has excellent overall stats
•Even fits a one-factor model with fit indices good enough for Psych Assessment purists.
•3 item scale has some problems as a reliable psychological test
•May be too short to act as a scale with a reliable sum score
•A set of 3 warning flags?
Acorn 10 item scale and 3 item index
24
25
26