choose the best items: a basic psychometric toolkit for testmakers warren lambert vanderbilt kennedy...

33
Choose the best items: A basic psychometric toolkit for testmakers Warren Lambert Vanderbilt Kennedy Center February 2007

Upload: kennedi-glasson

Post on 30-Mar-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Choose the best items: A basic psychometric toolkit for testmakers Warren Lambert Vanderbilt Kennedy Center February 2007

Choose the best items: A basic psychometric toolkit for testmakers

 

Warren LambertVanderbilt Kennedy Center

February 2007

Page 2: Choose the best items: A basic psychometric toolkit for testmakers Warren Lambert Vanderbilt Kennedy Center February 2007

Examples of Recent Test Development by KC Investigators

Peabody

Two different tests of school-based reading ability

A test of school-based math skills

Very early signs of autism spectrum in infants

A battery of 10 new tests for tracking mental health treatment of children

VUMC

Somatizing in children with recurrent abdominal pain

Survey of attending MD satisfaction with a department in hospital

Psychological rigidity in children

Goal of Today’s Session

Provide tools for people making their first index or test.

Page 3: Choose the best items: A basic psychometric toolkit for testmakers Warren Lambert Vanderbilt Kennedy Center February 2007

What Is a “Test”

Could be questionnaire

A set of items in a structured interview

Signs & symptoms of something

Often a “fuzzy” construct with numerous imperfect indicators, e.g. Beck Depression Inventory, SF-36, CBCL

Tests gain reliability by combining imperfect items into a total score. The sum of items will be more reliable than any single item.

A test is a set of items that produces a total score

Page 4: Choose the best items: A basic psychometric toolkit for testmakers Warren Lambert Vanderbilt Kennedy Center February 2007

How to Identify the Best ItemsA toolkit, not an analytic plan

Flag weaker items to drop or revise

Identify the weaker items

Relative, not absolute criteria

Classical test theory Enough for most medical research

Floors or ceilings restrict variance

To increase Cronbach’s alpha, avoid low item-total correlations

Guesstimate test length with Spearman-Brown formula

Factor analysis (exploratory and confirmatory)

Are there items that don’t fit the construct?

Avoid items that do not load on the main factor

See how well a confirmatory model fits

Rasch modeling

Pick items that fit a carefully considered measurement model

Consider item difficulties more deeply

Pick items that suit the intended task

Informal

Formal

Page 5: Choose the best items: A basic psychometric toolkit for testmakers Warren Lambert Vanderbilt Kennedy Center February 2007

Psychometrics vs Statistics

Statistics:

Find a statistical model that fits your data

Psychometric test construction:

Find data that fits your statistical model

Choose sound measurement models and pick items that fit by dropping weaker items.

Page 6: Choose the best items: A basic psychometric toolkit for testmakers Warren Lambert Vanderbilt Kennedy Center February 2007

Classical Test Theory (CTT)

Basic description of items

Can be done with SAS SPSS STATA S+ R etc

Do this routinely with scales old and new

Informal test development e.g. one-shot ad hoc index for an article

Not enough for tests that will be widely used in many settings

Page 7: Choose the best items: A basic psychometric toolkit for testmakers Warren Lambert Vanderbilt Kennedy Center February 2007

Note Floors or CeilingsThe “Too Short” IQ Test (TS-IQ)

Low mean, SD, variance all indicate floors or ceilings, but kurtosis is very easy to spot.

The “Too Short” IQ Test data set with SAS and SPSS code available for downloadhttp://kc.vanderbilt.edu/quant/Seminar/schedule.htm

Page 8: Choose the best items: A basic psychometric toolkit for testmakers Warren Lambert Vanderbilt Kennedy Center February 2007

Hard, Medium, & Easy Items#1, #6, #10

Wrong Right0.0

0.2

0.4

0.6

0.8

1.0

Wrong Right Wrong Right

Measuring entire population requires a range of item difficulties. If everyone has the same score, the item gives no information.

Kurtosis: 11 -2 3

Floor

Ceiling

Page 9: Choose the best items: A basic psychometric toolkit for testmakers Warren Lambert Vanderbilt Kennedy Center February 2007

Use Excel Conditional Formatting to Flag Problems

Page 10: Choose the best items: A basic psychometric toolkit for testmakers Warren Lambert Vanderbilt Kennedy Center February 2007

Retain Flagged Estimates of Quality“Too Short IQ Test” (TS-IQ)

Variable Mean Kurtosis

Item01 0.06 11.16

Item02 0.22 -0.11

Item03 0.35 -1.61

Item04 0.39 -1.82

Item05 0.45 -1.96

Item06 0.49 -2.01

Item07 0.54 -1.99

Item08 0.58 -1.91

Item09 0.77 -0.29

Item10 0.86 2.60

Page 11: Choose the best items: A basic psychometric toolkit for testmakers Warren Lambert Vanderbilt Kennedy Center February 2007

Item-total CorrelationsHow can you add unrelated things into a single total??

If an item is uncorrelated with other items, it doesn’t contribute to the internal-consistency reliability of the total score

Software packages like SAS SPSS etc will do item-total correlations very easily

Good check to use routinely

Page 12: Choose the best items: A basic psychometric toolkit for testmakers Warren Lambert Vanderbilt Kennedy Center February 2007

Biological Age Index (Frailty) Negative Item-Total Correlations Are Bad

Forgot to “flip” items on left

Correlation with Total High is Label

Correlationwith Total

0.57 Good Feet Walked In Six Minutes 0.68

0.42 Good Rank For Variable Foot 0.59

0.30 Good Times Weight Lifting 0.54

-0.40 Bad Seconds Trail B 0.44

0.42 Good Standing Forward Bend 0.41

0.38 Good Tinetti Balance Score 0.38

-0.30 Bad GDS Depression (High=Sad) 0.38

0.20 Good Sum of 3 Exercise Measures 0.35

-0.20 Bad Charlson Comorbidity Index 0.32

-0.12 Bad Mini Mental State Examination 0.15

-0.09 Bad Body Mass Index 0.02

Make sure all items are high-is-good or high-is-bad

Goffaux, J., G. C. Friesinger, Lambert, E.W. et al. (2005). "Biological age--a concept whose time has come: a preliminary study." South Med J 98(10): 985-93.

Page 13: Choose the best items: A basic psychometric toolkit for testmakers Warren Lambert Vanderbilt Kennedy Center February 2007

“TS-IQ,” Low Item-Total r’s are BadSPSS Reliability or SAS PROC CORR ALPHA

SPSS Relilability

Page 14: Choose the best items: A basic psychometric toolkit for testmakers Warren Lambert Vanderbilt Kennedy Center February 2007

“Too Short” Item-Total Correlations

•Items with nothing in common would not have a reliable total score

•Cronbach’s alpha internal consistency reliability

•Reliability increases with high item-total correlations

Item Mean Kurtosisr(Item-Total)

Item01 0.06 11.16 0.30

Item02 0.22 -0.11 0.53

Item03 0.35 -1.61 0.53

Item04 0.39 -1.82 0.43

Item05 0.45 -1.96 0.53

Item06 0.49 -2.01 0.60

Item07 0.54 -1.99 0.59

Item08 0.58 -1.91 0.36

Item09 0.77 -0.29 0.49

Item10 0.86 2.60 0.39

Page 15: Choose the best items: A basic psychometric toolkit for testmakers Warren Lambert Vanderbilt Kennedy Center February 2007

How Many Items?Spearman-Brown’s Predicted Reliability = F(N Items)

N Items

5 10 15 20 25 30 35P

rte

dic

ted

Rlia

bili

ty

0.0

0.2

0.4

0.6

0.8

1.0

Classical Test Theory: Reliability increases with the number of items

Put the the S-B formula into Excel to see approximately how many items you need for desired reliability under CTT.

Brown, W. (1910). Some experimental results in the correlation of mental abilities. British Journal of Psychology, 3, 296-322.Spearman, C. (1910). Correlation calculated from faulty data. British Journal of Psychology, 3, 171-195.

Page 16: Choose the best items: A basic psychometric toolkit for testmakers Warren Lambert Vanderbilt Kennedy Center February 2007

How Much is Enough?

• For local use of an ad hoc research index, CTT may suffice

• Formal tests (available for general use) require more thorough psychometric analysis

• Factor analysis and Item Response Theory modeling

Page 17: Choose the best items: A basic psychometric toolkit for testmakers Warren Lambert Vanderbilt Kennedy Center February 2007

Factor Analysis (FA)Beginning formal test development

Goal is to make sure the test’s theory foundations agree with the test data

We want to produce one or more single-factor tests

Use EFA (exploratory factor analysis) and

CFA (confirmatory factor analysis)

Page 18: Choose the best items: A basic psychometric toolkit for testmakers Warren Lambert Vanderbilt Kennedy Center February 2007

Scree Plot of “TS-IQ”

•Run a principal components analysis with SAS, SPSS etc

•“Scree” plot of eigenvalues

•Cattell’s metaphor, a mountain rising above useless rubble

•Is there more than one big component?

•Hard to get multiple factors (subtests) from the “Too short IQ test”

•Kaiser criterion, min eigenvalue > 1 extremely liberal, makes unstable factorsComponent Number

1 2 3 4 5 6 7 8 9 10

Eig

en

valu

e

0

1

2

3

4

5

1

2 3

Page 19: Choose the best items: A basic psychometric toolkit for testmakers Warren Lambert Vanderbilt Kennedy Center February 2007

Formal Test ConstructionVUMC Pediatric Researcher’s Three Samples

1. N = 181 children rating understandability of items

2. N = 513 Psychometric sample 1

3. Psychometric sample 2, N = 675

2a. Random 50% N=346 exploratory sample (CTT, EFA)

2b. Random 50% confirmatory sample (CTT, CFA)

Page 20: Choose the best items: A basic psychometric toolkit for testmakers Warren Lambert Vanderbilt Kennedy Center February 2007
Page 21: Choose the best items: A basic psychometric toolkit for testmakers Warren Lambert Vanderbilt Kennedy Center February 2007

Confirmatory Factor Analysis

See how well (ha! how badly!) the data fit a theory-driven model: “factorial validity”

Theory: TS-IQ measures a single dimension of intelligence.

Run a measurement model

Look at fit indices

Very popular in psychology, rarely done in nonpsychiatric medicine (exception: SF-36 has extensive psychometric analysis)

Page 22: Choose the best items: A basic psychometric toolkit for testmakers Warren Lambert Vanderbilt Kennedy Center February 2007

“Too Short IQ” SAS CFA of single-factor measurement model

RMSEA < .05, CFI > 0.95 or 0.96 (very high standards of unidimensionality)

Warning: So far, most VU tests early in their development haven’t met the high standards for measurement model fit.

Page 23: Choose the best items: A basic psychometric toolkit for testmakers Warren Lambert Vanderbilt Kennedy Center February 2007

Run Rasch or IRTIRT, Item Response Theory

Rasch: One parameter logistic model

Good for practical test development (converges)

E.g. Winsteps ($100 or $200)

Item Response Theory (IRT)

1-2-3 parameter models

Good for research

Need large samples

E.g. Parscale, Bilog-MG, Multilog ($100 VU site license)

P = Prob of getting item i “right”

Theta = persons ability

B = item’s difficulty on same scale

Page 24: Choose the best items: A basic psychometric toolkit for testmakers Warren Lambert Vanderbilt Kennedy Center February 2007

Rasch Model

• “Measure score” for person and item in same units

• If you’re better than the item, p (right) > 50%

• 1 Parm logistic model

As (Person – Item) increases, prob (right) increases in logistic model.

Page 25: Choose the best items: A basic psychometric toolkit for testmakers Warren Lambert Vanderbilt Kennedy Center February 2007

Rasch ModelItems spread over a range of difficulties

http://en.wikipedia.org/wiki/Rasch_model

Easy items

Hard items

Page 26: Choose the best items: A basic psychometric toolkit for testmakers Warren Lambert Vanderbilt Kennedy Center February 2007

WINSTEPSOne-parameter Rasch program(see http://www.winsteps.com)

$200 ($99 on summer sale)

Page 27: Choose the best items: A basic psychometric toolkit for testmakers Warren Lambert Vanderbilt Kennedy Center February 2007

“TS IQ” Items Information Spread Across Whole Range

Easy items, like #10, are most informative about low scoring individuals

Hard items, like #1, are most informative about high scoring individuals.

This test’s items spread to describe whole range of IQs

Page 28: Choose the best items: A basic psychometric toolkit for testmakers Warren Lambert Vanderbilt Kennedy Center February 2007

Persons & Items on One Scale

Rasch model measures each item and each person on the same scale

Concentrate your items where they are needed

Measure everyone

Measure high clinical cases most efficiently

TS-IQ measures across a wide range

Page 29: Choose the best items: A basic psychometric toolkit for testmakers Warren Lambert Vanderbilt Kennedy Center February 2007

VUMC Clinical Test Focuses on CutpointUnlike the TS-IQ

• School sample• High is bad (sicker)• Clinical screens focus

on sick people• Classify: treat yes-no• Job is to be maximally

informative at the cutpoint

• This test invests its items in severe range

Page 30: Choose the best items: A basic psychometric toolkit for testmakers Warren Lambert Vanderbilt Kennedy Center February 2007

Putting It All TogetherTS-IQ’s Items and Total

Page 31: Choose the best items: A basic psychometric toolkit for testmakers Warren Lambert Vanderbilt Kennedy Center February 2007

Putting It All TogetherVUMC Pediatrics

Items go 0-4

Many items near the floor (LE 1)

The lowest few have excessive kurtosis

However many item-total rs and Rasch fit stats are OK

Test maker can shorten this with considerable latitude, e.g. with content analysis.

Page 32: Choose the best items: A basic psychometric toolkit for testmakers Warren Lambert Vanderbilt Kennedy Center February 2007

Putting It All Together

Test has one odd item that measures something else.

Drop or revise that item.

Page 33: Choose the best items: A basic psychometric toolkit for testmakers Warren Lambert Vanderbilt Kennedy Center February 2007

How to Identify the Best ItemsA toolkit, not an analytic plan

Flag weaker items to drop or revise

Identify the weaker

Relative, not absolute criteria

Classical test theory Enough for most medical research

Floors or ceilings restrict variance

To increase Cronbach’s alpha, avoid low item-total correlations

Guesstimate test length with Spearman-Brown formula

Factor analysis (exploratory and confirmatory)

Are there items that don’t fit the construct?

Avoid items that do not load on the main factor

See how well a confirmatory model fits

Rasch modeling

Pick items that fit a carefully considered measurement model

Consider item difficulties more deeply

Pick items that suit the intended task

Informal

Formal