1 choosing reliable items: an objective multi-model approach to applied psychometrics warren lambert...

1

Choosing Reliable Items:An objective multi-model approach to applied psychometrics

Warren LambertPeabody College & Vanderbilt Kennedy Center

Measurement 2.0 Conference, Salt Lake City, February 2009

2

.

Len Bickman’s Peabody Treatment Progress Battery

To give away a suite of tools to evaluate client progress in counseling, we had to develop 17 “new” tests. This required a formal systematic approach.

http://peabody.vanderbilt.edu/Microsites/Center/Center_for_Evaluation_and_Program_Improvement_(CEPI)/The_Peabody_Treatment_Progress_Battery_(PTPB).xml

3

Statistical Approach: Imperfect Complementary Models

NETFLIX Winners, “We found it was important to utilize a variety of models that complement the shortcomings of each other. . . .

Lessons Learned. . . the best predictive performance came from combining complementary models.”

CHASING $1,000,000: HOW WE WON THE NETFLIX PROGRESS PRIZE

Robert Bell, Yehuda Koren, and Chris VolinskyAT&T Labs – Research

VOLUME 18, NO 2, DEC. 2007

4

How to Identify Reliable ItemsClassical test theory Enough for one-shot ad hoc indices

Floors or ceilings restrict variance

Look at a PCA

To increase Cronbach’s alpha, avoid low item-total correlations

Guesstimate test length with Spearman-Brown formula

Factor analysis (confirmatory if at all possible)

See how well a 1-factor confirmatory model fits

Factorial “validity,” does the factor structure fit theory?

Rasch (IRT) modeling

Pick items that fit a carefully considered measurement model

Consider item difficulties more deeply

Pick items suited to the intended task

Informal

Formal

5

Classical Test Theory (CTT)Tools of CTT

Basic description of items & their correlations

Cronbach’s alpha, internal-consistency reliability

Corrected item-total correlations

Principal components (PCA)

Spearman-Brown test length estimation

CTT is good to do routinely with index scores

OK for informal test development e.g., one-shot ad hoc index

Insufficient for tests that will be published for wide use

6

Note Floors or CeilingsThe “Too Short” IQ Test (TS-IQ)

Low mean, SD, variance all indicate floors or ceilings, but outrageous kurtosis is easy to see.

7

Acorn 10 item scale and 3 item index

8

Retain Flagged Estimates of Item Quality“Too Short IQ Test” (TS-IQ)

Variable Mean Kurtosis

Item01 0.06 11.16

Item02 0.22 -0.11

Item03 0.35 -1.61

Item04 0.39 -1.82

Item05 0.45 -1.96

Item06 0.49 -2.01

Item07 0.54 -1.99

Item08 0.58 -1.91

Item09 0.77 -0.29

Item10 0.86 2.60

9

Raw Scree Plots for ComparisonCompare Result to Random Shadow

• Simple principal components (Pearson, 1901)

• 10,000 PCAs on random numbers

• Same size data set• Half page R code• Visually distinguish

chance effects• Falls short of a

confirmatory factor analysis

Greco, L. A., Lambert, W., & Baer, R. A. (2008). Psychological inflexibility in childhood and adolescence: Development and evaluation of the Avoidance and Fusion Questionnaire for Youth. Psychological Assessment, 20(2), 93-102.

10

“Too Short IQ” Spreadsheet•Items with something in common contribute to a reliable total score

•Cronbach’s alpha internal consistency reliability

•Reliability increases with high item-total correlations

•Reliability increases with test length

Item Mean Kurtosisr(Item-Total)

Item01 0.06 11.16 0.30

Item02 0.22 -0.11 0.53

Item03 0.35 -1.61 0.53

Item04 0.39 -1.82 0.43

Item05 0.45 -1.96 0.53

Item06 0.49 -2.01 0.60

Item07 0.54 -1.99 0.59

Item08 0.58 -1.91 0.36

Item09 0.77 -0.29 0.49

Item10 0.86 2.60 0.39

11

How to Identify Reliable Items 2Classical test theory Enough for one-shot ad hoc indices

Floors or ceilings restrict variance

Look at a PCA

To increase Cronbach’s alpha, avoid low item-total correlations

Guesstimate test length with Spearman-Brown formula

Factor analysis (confirmatory if at all possible)

See how well a 1-factor confirmatory model fits

Factorial “validity,” does the factor structure fit theory?

Rasch (IRT) modeling

Pick items that fit a carefully considered measurement model

Consider item difficulties more deeply

Pick items suited to the intended task

Informal

Formal

12

Confirmatory Factor Analysis (CFA)

See how well (ha! how badly!) the data fit a theory-driven model (factorial “validity”)

Theory: TS-IQ measures g, a single dimension of intelligence.

Evaluate the fit of a single factor measurement model

CFA, popular in psychology, seldom done in non-psychiatric medicine (exception: Quality of life indices have extensive psychometric analysis using all current methods)

13

“Too Short IQ” SAS CFA of single-factor measurement model

RMSEA < .05, CFI > 0.95 or 0.96 (high standards of model fit)

So far, most VU tests early in development fail to meet the high standards for

measurement model fit.

SAS PROC CALIS, old fashioned but (more or less) useable

14

Rasch or IRT ModelIRT, Item Response Theory

Rasch: One parameter logistic IRT model

Good for practical test development (converges)

Multi-parameter Item Response Theory (IRT)

2-3 parameter models (discrimination, guessing)

For measurement research

Software, e.g. R, MPLUS, Parscale, Bilog-MG, user-written procs

P = Prob of getting item i “right”

Theta = persons ability

b = item’s difficulty on same scale

15

Rasch Model

• “Measure score” for person and item in same units

• If your measure = item’s measure, p(right) = 50%

• If you’re better than the item, p (right) > 50%

• 1 Parm logistic model (1PLM)

As (Person – Item) increases, prob (correct) increases in logistic model.

16

Rasch (1960/1980) model

Simple 1PLM, can use conventional total score or table lookup

Parallel logistic curves for items

Good for practical test construction (WINSTEPS)

Software in development > 20 years

IRT 2PLM, 3PLM may be better for certain kinds of measurement research

Rasch, G. (1960/1980). Studies in mathematical psychology: I. Probabilistic models for some intelligence and attainment tests (Expanded ed.). Chicago: University of Chicago Press.

Statistician, Danish student of RA Fisher

Rasch Model:TS-IQ Items Cover a Range of Difficulties

17

“Too Short IQ” Items Information Spread Across Whole Range

Easy items, like #10, are most informative about low scoring individuals

Hard items, like #1, are most informative about high scoring individuals.

This test’s items spread to describe whole range of IQs

18

IRT: Compare Items with PeopleClinically Targeted Test (VUMC Greco)

• Items gray, people black• School sample• High is bad (sicker)• Clinical screens focus on

sick people• Classify: treat yes-no• Job is to be maximally

informative at the cutpoint• This test invests its items in

severe range

Greco, L. A., Lambert, W., & Baer, R. A. (2008). Psychological inflexibility in childhood and adolescence: Development and evaluation of the Avoidance and Fusion Questionnaire for Youth. Psychological Assessment, 20(2), 93-102.

19

•Left, distribution of children (each # = 3)

•Right, distribution of items

•Centerline, measure score, theta for people

and “difficulty” for items

•Self-harm item, a severe outlier

•9 Items concentrated in low-average range

•Are they concentrated near the clinical-normal

threshold?

Acorn 10 item scale

20

Putting It All TogetherToo Short-IQ’s Items and Total

21

Putting it all together (Walker’s CSI)Multiple criteria converge => firm conclusion without definitive cutoffs or perfect models

Items scored 0-4

Items 1-35

Walker, Lynn S., Beck, Joy E., Garber, Judy, & Lambert, Warren. The Children’s Somatization Inventory: Psychometric Properties of the Revised Form (CSI-24) and Evidence for a Continuum of Symptom Reporting in Youth. In press, J. Pediatric Psychology.

22

Bold items, some concern

Self-harm, having a low mean, shows some roughness (no fatal flaws).

Infit/outfit flags are borderline. Good is now 0.7-1.3, used to be 0.5-1.5

A&D items are near the floor, but still seem to work.


23

•10 item scale has excellent overall stats

•Even fits a one-factor model with fit indices good enough for Psych Assessment purists.

•3 item scale has some problems as a reliable psychological test

•May be too short to act as a scale with a reliable sum score

•A set of 3 warning flags?


1 choosing reliable items: an objective multi-model approach to applied psychometrics warren lambert...

Documents

item index slide

factor confirmatory

measurement model fit

evaluation slide

scale slide

irt model irt

useable slide

factor measurement model