1 chapter 4 – reliability 1. observed scores and true scores 2. error 3. how we deal with sources...

42
1 Chapter 4 – Reliability 1. Observed Scores and True Scores 2. Error 3. How We Deal with Sources of Error: A. Domain sampling – test items B. Time sampling – test occasions C. Internal consistency – traits 4. Reliability in Observational Studies 5. Using Reliability Information 6. What To Do about Low Reliability

Upload: grady-shotton

Post on 14-Dec-2015

217 views

Category:

Documents


1 download

TRANSCRIPT

1

Chapter 4 – Reliability

1. Observed Scores and True Scores2. Error3. How We Deal with Sources of Error:

A. Domain sampling – test itemsB. Time sampling – test occasionsC. Internal consistency – traits

4. Reliability in Observational Studies5. Using Reliability Information6. What To Do about Low Reliability

2

Chapter 4 - Reliability

• Measurement of human ability and knowledge is challenging because: ability is not directly observable – we infer

ability from behavior all behaviors are influenced by many

variables, only a few of which matter to us

3

Observed Scores

O = T + e

O = Observed score

T = True score

e = error

4

Reliability – the basics

1. A true score on a test does not change with repeated testing

2. A true score would be obtained if there were no error of measurement.

3. We assume that errors are random (equally likely to increase or decrease any test result).

5

Reliability – the basics

• Because errors are random, if we test one person many times, the errors will cancel each other out

• (Positive errors cancel negative errors)

• Mean of many observed scores for one person will be the person’s true score

6

Reliability – the basics

• Example: to measure Sarah’s spelling ability for English words.

• We can’t ask her to spell every word in the OED, so…

• Ask Sarah to spell a subset of English words

• % correct estimates her true English spelling skill

• But which words should be in our subset?

7

Estimating Sarah’s spelling ability…

• Suppose we choose 20 words randomly…

• What if, by chance, we get a lot of very easy words – cat, tree, chair, stand…

• Or, by chance, we get a lot of very difficult words – desiccate, arteriosclerosis, numismatics

8

Estimating Sarah’s spelling ability…

• Sarah’s observed score varies as the difficulty of the random sets of words varies

• But presumably her true score (her actual spelling ability) remains constant.

9

Reliability – the basics

• Other things can produce error in our measurement

• E.g. on the first day that we test Sarah she’s tired

• But on the second day, she’s rested…

• This would lead to different scores on the two days

10

Estimating Sarah’s spelling ability…

• Conclusion:

O = T + e

But e1 ≠ e2 ≠ e3 …

• The variation in Sarah’s scores is produced by measurement error.

• How can we measure such effects – how can we measure reliability?

11

Reliability – the basics

• In what follows, we consider various sources of error in measurement.

• Different ways of measuring reliability are sensitive to different sources of error.

12

How do we deal with sources of error?

• Error due to test items • Domain sampling error

13

How do we deal with sources of error?

• Error due to test items• Error due to testing

occasions

• Time sampling error

14

How do we deal with sources of error?

• Error due to test items• Error due to testing

occasions• Error due to testing

multiple traits

• Internal consistency error

15

Domain Sampling error

• A knowledge base or skill set containing many items is to be tested. E.g., the chemical

properties of foods.

• We can’t test the entire set of items. So we select a sample

of items. That produces domain

sampling error, as in Sarah’s spelling test.

16

Domain Sampling error

• There is a “domain” of knowledge to be tested

• A person’s score may vary depending upon what is included or excluded from the test.

17

Domain Sampling error

• Smaller sets of items may not test entire knowledge base.

• Larger sets of items should do a better job of covering the whole knowledge base.

• As a result, reliability of a test increases with the number of items on that test

18

Domain Sampling error

• Parallel Forms Reliability:

• choose 2 different sets of test items.

• these 2 sets give you “parallel forms” of the test

• Across all people tested, if correlation between scores on 2 parallel forms is low, then we probably have domain sampling error.

19

Time Sampling error

• Test-retest Reliability person taking test

might be having a very good or very bad day – due to fatigue, emotional state, preparedness, etc.

• Give same test repeatedly & check correlations among scores

• High correlations indicate stability – less influence of bad or good days.

20

Time Sampling error

• Test-retest approach is only useful for traits – characteristics that don’t change over time

• Not all low test-retest correlations imply a weak test

• Sometimes, the characteristic being measured varies with time (as in learning)

21

Time Sampling error

• Interval over which correlation is measured matters

• E.g., for young children, use a very short period (< 1 month, in general)

• In general, interval should not be > 6 months

• Not all low test-retest correlations imply a weak test

• Sometimes, the characteristic being measured varies with time (as in learning)

22

Time sampling error

• Test-retest approach advantage: easy to evaluate, using correlation

• Disadvantage: carryover & practice effects

• Carryover: first testing session influences scores on next session

• Practice: when carryover effect involves learning

23

Internal Consistency error

• Suppose a test includes both items on social psychology and items requiring mental rotation of abstract visual shapes.

• Would you expect much correlation between scores on the two parts? No – because the two

‘skills’ are unrelated.

24

Internal Consistency Approach

• A low correlation between scores on 2 halves of a test, suggests that the test is tapping two different abilities or traits.

• A good test has high correlations between scores on its two halves. But how should we

divide the test in two to check that correlation?

25

Internal Consistency error

• Split-half method• Kuder-Richardson

formula• Cronbach’s alpha

• All of these assess the extent to which items on a given test measure the same ability or trait.

26

Split-half Reliability

• After testing, divide test items into halves A & B that are scored separately.

• Check for correlation of results for A with results for B.

• Various ways of dividing test into two – randomly, first half vs. second half, odd-even…

27

Split-half Reliability – a problem

• Each half-test is smaller than the whole

• Smaller tests have lower reliability (domain sampling error)

• So, we shouldn’t use the raw split-half reliability to assess reliability for the whole test

28

Split-half reliability – a problem

• We correct reliability estimate using the Spearman-Brown formula: re = 2rc

1+ rc

re = estimated reliability for the test

rc = computed reliability (correlation between scores on the two halves A and B)

29

Kuder-Richardson 20

• Kuder & Richardson (1937): an internal-consistency measure that doesn’t require arbitrary splitting of test into 2 halves.

• KR-20 avoids problems associated with splitting by simultaneously considering all possible ways of splitting a test into 2 halves.

30

Kuder-Richardson 20

• The formula contains two basic terms:

1. a measure of all the variance in the whole set of test results.

31

Kuder-Richardson 20

• The formula contains two basic terms:

2. “item variance” – when items measure the same trait, they co-vary (same people get them right or wrong). More co-variance = less “item variance”

32

Internal Consistency – Cronbach’s α

• KR-20 can only be used with test items scored as 1 or 0 (e.g., right or wrong, true or false).

• Cronbach’s α (alpha) generalizes KR-20 to tests with multiple response categories.

• α is a more generally-useful measure of internal consistency than KR-20

33Review: How do we deal with sources of error?

Approach Measures Issues

Test-Retest Stability of scoresCarryover

Parallel Forms Equivalence & Stability Effort

Split-half Equivalence & Internal Shortenedconsistency test

KR-20 & α Equivalence & Internal Difficult to

consistency calculate

34

Reliability in Observational Studies

• Some psychologists collect data by observing behavior rather than by testing.

• This approach requires time sampling, leading to sampling error

• Further error due to: observer failures inter-observer

differences

35

Reliability in Observational Studies

• Deal with possibility of failure in the single-observer situation by having more than 1 observer.

• Deal with inter-observer differences using: Inter-rater reliability Kappa statistic

36

Reliability in Observational Studies

• Inter-rater reliability • % agreement between 2 or more observers problem: in a 2-choice

case, 2 judges have a 50% chance of agreeing even if they guess!

this means that % agreement may over-estimate inter-rater reliability.

37

Reliability in Observational Studies

• Kappa Statistic (Cohen,1960)

• estimates actual inter-rater agreement as a proportion of potential inter-rater agreement after correction for chance.

38

Using Reliability Information

• Standard error of measurement (SEM)

• estimates extent to which test score misrepresents a true score.

• SEM = (S)(1 – r)

39

Standard Error of Measurement

• We use SEM to compute a confidence interval for a particular test score.

• The interval is centered on the test score

• We have confidence that the true score falls in this interval

• E.g., 95% of the time the true score will fall within 1.96 SEM either way of the test (observed) score.

40

Standard Error of Measurement

• A simple way to think of the SEM:

• Suppose we gave one student the same test over and over

• Suppose, too, that no learning took place between tests and the student did not memorize questions

• The standard deviation of the resulting set of test scores (for this one student) would be the standard error of measurement.

41

What to do about low reliability

• Increase the number of items

• To find how many you need, use Spearman-Brown formula

• Using more items may introduce new sources of error such as fatigue, boredom

42

What to do about low reliability

• Discriminability analysis

• Find correlations between each item and whole test

• Delete items with low correlations