1 chapter 4 – reliability 1. observed scores and true scores 2. error 3. how we deal with sources...

1

Chapter 4 – Reliability

1. Observed Scores and True Scores2. Error3. How We Deal with Sources of Error:

A. Domain sampling – test itemsB. Time sampling – test occasionsC. Internal consistency – traits

4. Reliability in Observational Studies5. Using Reliability Information6. What To Do about Low Reliability

2

Chapter 4 - Reliability

• Measurement of human ability and knowledge is challenging because: ability is not directly observable – we infer

ability from behavior all behaviors are influenced by many

variables, only a few of which matter to us

3

Observed Scores

O = T + e

O = Observed score

T = True score

e = error

4

Reliability – the basics

1. A true score on a test does not change with repeated testing

2. A true score would be obtained if there were no error of measurement.

3. We assume that errors are random (equally likely to increase or decrease any test result).

5


• Because errors are random, if we test one person many times, the errors will cancel each other out

• (Positive errors cancel negative errors)

• Mean of many observed scores for one person will be the person’s true score

6


• Example: to measure Sarah’s spelling ability for English words.

• We can’t ask her to spell every word in the OED, so…

• Ask Sarah to spell a subset of English words

• % correct estimates her true English spelling skill

• But which words should be in our subset?

7

Estimating Sarah’s spelling ability…

• Suppose we choose 20 words randomly…

• What if, by chance, we get a lot of very easy words – cat, tree, chair, stand…

• Or, by chance, we get a lot of very difficult words – desiccate, arteriosclerosis, numismatics

8


• Sarah’s observed score varies as the difficulty of the random sets of words varies

• But presumably her true score (her actual spelling ability) remains constant.

9


• Other things can produce error in our measurement

• E.g. on the first day that we test Sarah she’s tired

• But on the second day, she’s rested…

• This would lead to different scores on the two days

10


• Conclusion:

O = T + e

But e1 ≠ e2 ≠ e3 …

• The variation in Sarah’s scores is produced by measurement error.

• How can we measure such effects – how can we measure reliability?

11


• In what follows, we consider various sources of error in measurement.

• Different ways of measuring reliability are sensitive to different sources of error.

12

How do we deal with sources of error?

• Error due to test items • Domain sampling error

13


• Error due to test items• Error due to testing

occasions

• Time sampling error

14


• Error due to test items• Error due to testing

occasions• Error due to testing

multiple traits

• Internal consistency error

15

Domain Sampling error

• A knowledge base or skill set containing many items is to be tested. E.g., the chemical

properties of foods.

• We can’t test the entire set of items. So we select a sample

of items. That produces domain

sampling error, as in Sarah’s spelling test.

16


• There is a “domain” of knowledge to be tested

• A person’s score may vary depending upon what is included or excluded from the test.

17


• Smaller sets of items may not test entire knowledge base.

• Larger sets of items should do a better job of covering the whole knowledge base.

• As a result, reliability of a test increases with the number of items on that test

18


• Parallel Forms Reliability:

• choose 2 different sets of test items.

• these 2 sets give you “parallel forms” of the test

• Across all people tested, if correlation between scores on 2 parallel forms is low, then we probably have domain sampling error.

19

Time Sampling error

• Test-retest Reliability person taking test

might be having a very good or very bad day – due to fatigue, emotional state, preparedness, etc.

• Give same test repeatedly & check correlations among scores

• High correlations indicate stability – less influence of bad or good days.

20

Time Sampling error

• Test-retest approach is only useful for traits – characteristics that don’t change over time

• Not all low test-retest correlations imply a weak test

• Sometimes, the characteristic being measured varies with time (as in learning)

21

Time Sampling error

• Interval over which correlation is measured matters

• E.g., for young children, use a very short period (< 1 month, in general)

• In general, interval should not be > 6 months

• Not all low test-retest correlations imply a weak test

• Sometimes, the characteristic being measured varies with time (as in learning)

22

Time sampling error

• Test-retest approach advantage: easy to evaluate, using correlation

• Disadvantage: carryover & practice effects

• Carryover: first testing session influences scores on next session

• Practice: when carryover effect involves learning

23

Internal Consistency error

• Suppose a test includes both items on social psychology and items requiring mental rotation of abstract visual shapes.

• Would you expect much correlation between scores on the two parts? No – because the two

‘skills’ are unrelated.

24

Internal Consistency Approach

• A low correlation between scores on 2 halves of a test, suggests that the test is tapping two different abilities or traits.

• A good test has high correlations between scores on its two halves. But how should we

divide the test in two to check that correlation?

25

Internal Consistency error

• Split-half method• Kuder-Richardson

formula• Cronbach’s alpha

• All of these assess the extent to which items on a given test measure the same ability or trait.

26

Split-half Reliability

• After testing, divide test items into halves A & B that are scored separately.

• Check for correlation of results for A with results for B.

• Various ways of dividing test into two – randomly, first half vs. second half, odd-even…

27

Split-half Reliability – a problem

• Each half-test is smaller than the whole

• Smaller tests have lower reliability (domain sampling error)

• So, we shouldn’t use the raw split-half reliability to assess reliability for the whole test

28

Split-half reliability – a problem

• We correct reliability estimate using the Spearman-Brown formula: re = 2rc

1+ rc

re = estimated reliability for the test

rc = computed reliability (correlation between scores on the two halves A and B)

29

Kuder-Richardson 20

• Kuder & Richardson (1937): an internal-consistency measure that doesn’t require arbitrary splitting of test into 2 halves.

• KR-20 avoids problems associated with splitting by simultaneously considering all possible ways of splitting a test into 2 halves.

30

Kuder-Richardson 20

• The formula contains two basic terms:

1. a measure of all the variance in the whole set of test results.

31

Kuder-Richardson 20

• The formula contains two basic terms:

2. “item variance” – when items measure the same trait, they co-vary (same people get them right or wrong). More co-variance = less “item variance”

32

Internal Consistency – Cronbach’s α

• KR-20 can only be used with test items scored as 1 or 0 (e.g., right or wrong, true or false).

• Cronbach’s α (alpha) generalizes KR-20 to tests with multiple response categories.

• α is a more generally-useful measure of internal consistency than KR-20

33Review: How do we deal with sources of error?

Approach Measures Issues

Test-Retest Stability of scoresCarryover

Parallel Forms Equivalence & Stability Effort

Split-half Equivalence & Internal Shortenedconsistency test

KR-20 & α Equivalence & Internal Difficult to

consistency calculate

34

Reliability in Observational Studies

• Some psychologists collect data by observing behavior rather than by testing.

• This approach requires time sampling, leading to sampling error

• Further error due to: observer failures inter-observer

differences

35


• Deal with possibility of failure in the single-observer situation by having more than 1 observer.

• Deal with inter-observer differences using: Inter-rater reliability Kappa statistic

36


• Inter-rater reliability • % agreement between 2 or more observers problem: in a 2-choice

case, 2 judges have a 50% chance of agreeing even if they guess!

this means that % agreement may over-estimate inter-rater reliability.

37


• Kappa Statistic (Cohen,1960)

• estimates actual inter-rater agreement as a proportion of potential inter-rater agreement after correction for chance.

38

Using Reliability Information

• Standard error of measurement (SEM)

• estimates extent to which test score misrepresents a true score.

• SEM = (S)(1 – r)

39

Standard Error of Measurement

• We use SEM to compute a confidence interval for a particular test score.

• The interval is centered on the test score

• We have confidence that the true score falls in this interval

• E.g., 95% of the time the true score will fall within 1.96 SEM either way of the test (observed) score.

40

Standard Error of Measurement

• A simple way to think of the SEM:

• Suppose we gave one student the same test over and over

• Suppose, too, that no learning took place between tests and the student did not memorize questions

• The standard deviation of the resulting set of test scores (for this one student) would be the standard error of measurement.

41

What to do about low reliability

• Increase the number of items

• To find how many you need, use Spearman-Brown formula

• Using more items may introduce new sources of error such as fatigue, boredom

42

What to do about low reliability

• Discriminability analysis

• Find correlations between each item and whole test

• Delete items with low correlations

1 chapter 4 – reliability 1. observed scores and true scores 2. error 3. how we deal with sources...

Documents

error of measurement

measurement error

test items error

different sources of

testing occasions error

various sources of error

low reliability slide

numismatics slide