how does health psychology measure up?

How does health psychology measure up?

A critical look at measurement in health psychology

Matthew Hankins16th September 2011

2

The empirical basis of Health Psychology• Why do Health Psychologists collect data?

– Theory generation, esp. identifying constructs– Theory corroboration – Measuring outcomes (trials etc.)

• The value of such activities is therefore critically dependent on the quality of the data

3

Questionnaire measures• Majority of data collected by Health Psychologists

is generated by questionnaire measures (‘scales’)

• Questionnaires vary in the quality of data that they generate

– Validity: extent to which the questionnaire measures what is intended

– Reliability: extent to which variance in data reflects variance in construct measured

• Index of measurement error

4

Pragmatic approach• Validity

– Unidimensionality (factor analysis)– Associations between measures– Discrimination between known groups

• Reliability

– Estimated by Cronbach’s Alpha– Or test-retest correlation

5

Scale development• Combination of these approaches is derived from

‘Classical Test Theory’ (CTT)

– Originated with Spearman (1904)– Landmark text: Guilford 2nd ed. (1954) – Fully developed by Lord & Novick (1968)

• Further developments: ‘item-response theory’ (IRT)

– E.g Rasch model (1960)

• CTT implicit in most empirical Health Psychology research

6

What is a scale?• A scale orders people on the construct of interest

• Both CTT & IRT agree that a person’s position on the dimension can be estimated from the item scores

• Strength of IRT is that it does not assume that a set of correlated items forms a scale

• Implicit in CTT: if items load on same factor, we automatically assume that they form a scale

Construct

Low Person A Person B Person C Person D High

7

Scaling problem• Whether a set of items forms a scale is a hypothesis

(Guttman 1950)

– Formally tested whether items formed ‘Guttman scales’

• “In contemporary psychometric practice, it is the rule rather than the exception that two people having the same score on a test will have [endorsed]different items…Such scores are crude empirical devices known to have some predictive efficiency, but they cannot be called measurements in any strict sense” (Loevinger 1948)

• Additionally, there is no rational basis for adding up a set of ordinal Likert scores unless they have been shown to scale

8

Example: PHQ-9• Feeling tired + Little interest in doing things +

Poor appetite several days in last 2 weeks

– Scale score = +3

• Thoughts of hurting yourself in some way nearly every day in last 2 weeks

– Scale score = +3

• Are these responses really equivalent?

9

Implications• If a set of items are assumed to form a scale, then

we cannot be sure that the scale score accurately ranks people on the construct of interest

– People with different positions may be assigned the same score

– People with the same position may be assigned different scores

• Unless we test this hypothesis, assessing reliability & validity is pointless

10

Rejecting the hypothesis of a scale• Scales are very rarely ‘rejected’ in health

psychology

• Reliability is usually reported as ‘acceptable’ or ‘good’

– Based on arbitrary cut-off around 0.7 (0.6, 0.5…)– “Test-retest reliability was acceptable (r=0.43)”

• Criteria for validity are usually not specified in advance

– Any factor structure can be accommodated– Any association can be cited as ‘validating’ scale

• Formal testing of ‘scalability’ of items rare

11

What we would like: interval scales

What we might have: ordinal scales

What we probably have: disordered categories

A scale that cannot rank-order people is not a scale

Disordered categories

12

Item ‘difficulty’ (intensity)• The problem arises because CTT does not account

for item difficulty or intensity

• Some items are endorsed at low levels of the construct

– ‘Low intensity item’– Endorsement may indicate low or high level of construct

• Some items are endorsed at high levels of the construct

– ‘High intensity item’– Endorsement indicates high level of construct

13

Example: PHQ-9• Feeling tired on several days is a low intensity item

– Endorsed at low level of depression– But may also be endorsed at higher levels of

depression

Depression

Low Yes Yes Yes Yes High

14

Example: PHQ-9• Thoughts of hurting yourself in some way nearly

every day in last 2 weeks is a high intensity item

– Endorsed at high level of depression– But not endorsed at lower levels of depression

Depression

Low No No No Yes High

15

How CTT fails to deal with item intensityFactor analysis groups items of similar intensity

• Factor analysis of a unidimensional construct will produce more than one ‘factor’

• These ‘factors’ are simply sets of items with similar intensities

16

Example: GHQ-12

• Example: GHQ-12

• Many studies report 2- or 3-factor solutions

• ‘Factors’ simply group items by intensity (Hankins 2008)

Psychiatric morbidity

Low High7 4 5 2 6 10 111 12 98 3

17

How CTT fails to deal with item intensitySelecting items on basis of factor analysis exacerbates problem, but simultaneously conceals it

• Items are selected on basis of similar intensities, creating scales with limited range but high reliability


Low High7 4 5 2 6 10 111 12 98 3

Low High

7 41 128 3


18

Why Rasch modelling is not the answer• Rasch modelling (RM) explicitly takes into account

item intensities

– Stochastic Guttman scale

• Tests the hypothesis that items form a scale

• Additionally claims to produce interval scaling & ‘objective’ measurement

• Increasingly popular in Health Psychology

19

CTT vs. IRT• Argument tends to be that IRT is superior to CTT &

IRT is ‘objective’ measurement

• Differences more apparent than real:

– Large correlations between CTT data & IRT data– If data treated as ordinal, perfect correlation

between CTT & Rasch data

From Embretson & Reise (2000)

20

GHQ-12: CTT scoring vs. RM scoring

21

Problems• Rasch models require very large samples to allow

estimation of person and item parameters

• Very strong assumptions, e.g. logistic item-response curve

– Why should all items have the same form of response?

• The data must fit the model, not the other way round

– Discards potentially useful data to fit arbitrary assumptions

• Interval scaling is questionable gain if psychological constructs are not quantitative in the first place

22

Ontological diversion• In general, psychologists seem to believe that

attributes are either categorical or quantitative

– A ‘cat’ is a different from a ‘tree’: different categories, difference is qualitative

– 30cm is different 60cm: different quantities, difference is quantitative

• Having made this distinction, quantitative attributes may be measured as categorical, ordinal, interval

• Ordinal attributes cannot exist in their own right

– Just a way of collecting data on a quantitative attribute

23

Ontological diversion• Russell (1896): the difference between two

quantities is itself a quantity

– The difference between two lengths is itself a length

• For psychological attributes to be quantitative, the difference between two ‘levels’ of that attribute must itself be a ‘level’ of that attribute

– Is the difference between two pleasures itself a pleasure?– Is the difference between two levels of depression itself a

level of depression?

• If not, are psychological states then merely categorical?

– But what then do we mean by ‘severity’ of depression?

24

Ontological diversion• Is it possible for psychological attributes to be

ordinal?

– Can something exist in degree but not quantity?

• Michell (2009) argues that we cannot assume quantity from degree

– shows that they are logically separable: “It is possible that an ordered attribute is non-quantitative”

• Collingwood (1933) argues that some concepts exist only in degree

25

Ontological diversion• Are we comfortable talking about degree, rather

than quantity?

• Implicit in our descriptions and experiences of psychological attributes

– But does not require the assumption that the attributes are quantitative

26

The degrees of the lie• JAQUES

– Can you nominate in order now the degrees of the lie?

• TOUCHSTONE

– O sir, we quarrel in print, by the book; as you have books for good manners: I will name you the degrees. The first, the Retort Courteous; the second, theQuip Modest; the third, the Reply Churlish; thefourth, the Reproof Valiant; the fifth, theCountercheque Quarrelsome; the sixth, the Lie withCircumstance; the seventh, the Lie Direct.

• As You Like It, Act 5 Scene 1

27

Summary• Measurement methods in health psychology are

suboptimal

• In particular, the fundamental assumption that correlated items form a scale is not routinely tested

• IRT models such as the Rasch model assume that interval scaling is meaningful

• Psychological attributes may not exist as quantities

• Is there a method for constructing purely ordinal scales?

28

Non-parametric IRT (NPIRT)• E.g. Mokken (1971)

• Takes into account item intensities

– Stochastic Guttman scale

• Claims only to rank order people

• Very weak assumptions

– Retains data

• Complements CTT

– Uses simple scale score

Examples of NPIRT analysis

• Mokken (1971) proposed two models

– Monotone homogeneity model (MH)– Doubly monotone model (DM)

• Scales fitting the MH model rank order people on the attribute of interest

• Corollary is that scales not fitting the MH model do not rank order people on the attribute of interest

• Select items for the scale based on homogeneity

• Assess whether the resulting scale fits the MH model

• Scaling procedure and the MH model based on the following minimal assumptions:

– For all items, if person A has a higher degree of X than person B, A’s probability of endorsing an item will be equal to or higher than B’s

– Local independence: item scores are uncorrelated for the same degree of attribute

• If the purpose of the scale is to rank order people on a given attribute then the scale must be monotone homogenous

• Probability of item being endorsed must be monotone nondecreasing against attribute

• i.e. probability of item endorsement does not decrease with an increase in the measured attribute

* - as estimated from the remaining items of the scale

For this GHQ-12 item the probability of endorsement reaches 50% at a low level of psychological distress.

It is therefore a low intensity item: people endorsing this item are signalling a low level of distress.

For this GHQ-12 item the probability of endorsement reaches 50% at a high level of psychological distress.

It is therefore a high intensity item: people endorsing this item are signalling a high level of distress.

• If two items belong to a unidimensional scale, then:

– Endorsing the more intense item entails that the less intense item also be endorsed

– Endorsing the less intense item does not entail that the more intense item be endorsed

• For a Guttman scale, these are deterministic statements

• For a Mokken scale, these are probabilistic statements

• A Guttman error occurs when the more intense item is endorsed but not the less intense item

• Too many Guttman errors imply that items are not measuring the same attribute

More intense item

Less intense item

• This asymmetrical relationship between item pairs can be summarised with Loevinger’s H

– H is the coefficient of homogeneity between two items i and j

• Ranges from 0.0 to 1.0

– 0.0 indicates no association between items– 1.0 indicates perfect association, given the differences in item

intensity– 1.0 also indicates no Guttman errors

• Mokken (1971) developed H for scale development

– Hij : Homogeneity of pair of items

– Hi : Homogeneity of item i with all items

– H : Homogeneity of scale

• All Hij > 0

• Start with item pair with highest Hij

• Select third item to maximise scale H

• Proceed until H reaches threshold value c

• Produces a unidimensional scale

– c = 0.3; weak scale– c = 0.4; medium scale– c = 0.5; strong scale– c = 1.0; perfect Guttman scale

Results for GHQ-12

Step Item Scale H1 p6d 0.791 n4d 0.792 n6d 0.733 n5d 0.684 n2d 0.645 n3d 0.616 p5d 0.597 p3d 0.578 p4d 0.559 n1d 0.5310 p2d 0.5111 p1d 0.50

• => the items of the GHQ-12 form a strong unidimensional scale

Monotone homogeneity model: GHQ-12

Item H #vi maxvi zmax #zsig

p1d 0.44 0 0.00 0.00 0

n1d 0.45 0 0.00 0.00 0

p2d 0.43 1 0.06 0.99 0

p3d 0.50 0 0.00 0.00 0

n2d 0.55 0 0.00 0.00 0

n3d 0.51 0 0.00 0.00 0

p4d 0.47 0 0.00 0.00 0

p5d 0.50 1 0.05 0.90 0

n4d 0.56 0 0.00 0.00 0

n5d 0.50 0 0.00 0.00 0

n6d 0.56 1 0.05 0.93 0

p6d 0.53 1 0.04 0.68 0

• Small deviations from MH model but none significant

Conclusion

• The GHQ-12 is a strongly homogenous unidimensional scale

• Small deviations from monotone homogeneity, none significant

• The GHQ-12 summed score can rank order people by the measured attribute

• i.e. it can serve as an ordinal measure of severity of psychiatric impairment

• Compare to results of EFA/CFA studies

Example: Northwick Park dependency scale

• Item selection from pool of 16 items

Item Scale H

Q8 0.93

Q5 0.93

Q9 0.93

Q2 0.91

Q1 0.88

Q13 0.87

Q7 0.84

Q12 0.82

Q6 0.79

Q14 0.76

Q4 0.74

Q3 0.70

Q11 0.67

Q15 0.62

• 14 items form unidimensional scale

• Two items with serious violations of monotone homogeneity

Item H #vi maxvi zmax #zsig

Q3 0.45 6 0.25 2.88 4

Q11 0.32 5 0.28 3.43 2

Q3: help required using toilet (urination)

Q11: help required with drinking

• Some items decrease in probability as attribute increases

• With extreme dependency, patients require less help with drinking and emptying bladder– Because at this extreme, they are more likely to be

tube-fed and catheterised • Hence, for these items, probability of

endorsement decreases as dependency increases– Scale is not monotone homogenous

• The summed score will not rank order people on the measured attribute

48

Summary• The credibility of Health Psychology research &

practice rests on its empirical evidence base

• This evidence base relies on the quality of questionnaire data

• The quality of questionnaire data may be compromised by the use of inappropriate methods

• We should stop relying on factor analysis & reliability coefficients & test the hypothesis that a set of items constitutes a scale

how does health psychology measure up?

Career

set of items

item intensityselecting

intensitysome items

items rare1011

set of correlated items

high levels

health psychologists

low levels