1 chapter 4 – reliability 1. observed scores and true scores 2. error 3. how we deal with sources...
TRANSCRIPT
1
Chapter 4 – Reliability
1. Observed Scores and True Scores2. Error3. How We Deal with Sources of Error:
A. Domain sampling – test itemsB. Time sampling – test occasionsC. Internal consistency – traits
4. Reliability in Observational Studies5. Using Reliability Information6. What To Do about Low Reliability
2
Chapter 4 - Reliability
• Measurement of human ability and knowledge is challenging because: ability is not directly observable – we infer
ability from behavior all behaviors are influenced by many
variables, only a few of which matter to us
4
Reliability – the basics
1. A true score on a test does not change with repeated testing
2. A true score would be obtained if there were no error of measurement.
3. We assume that errors are random (equally likely to increase or decrease any test result).
5
Reliability – the basics
• Because errors are random, if we test one person many times, the errors will cancel each other out
• (Positive errors cancel negative errors)
• Mean of many observed scores for one person will be the person’s true score
6
Reliability – the basics
• Example: to measure Sarah’s spelling ability for English words.
• We can’t ask her to spell every word in the OED, so…
• Ask Sarah to spell a subset of English words
• % correct estimates her true English spelling skill
• But which words should be in our subset?
7
Estimating Sarah’s spelling ability…
• Suppose we choose 20 words randomly…
• What if, by chance, we get a lot of very easy words – cat, tree, chair, stand…
• Or, by chance, we get a lot of very difficult words – desiccate, arteriosclerosis, numismatics
8
Estimating Sarah’s spelling ability…
• Sarah’s observed score varies as the difficulty of the random sets of words varies
• But presumably her true score (her actual spelling ability) remains constant.
9
Reliability – the basics
• Other things can produce error in our measurement
• E.g. on the first day that we test Sarah she’s tired
• But on the second day, she’s rested…
• This would lead to different scores on the two days
10
Estimating Sarah’s spelling ability…
• Conclusion:
O = T + e
But e1 ≠ e2 ≠ e3 …
• The variation in Sarah’s scores is produced by measurement error.
• How can we measure such effects – how can we measure reliability?
11
Reliability – the basics
• In what follows, we consider various sources of error in measurement.
• Different ways of measuring reliability are sensitive to different sources of error.
13
How do we deal with sources of error?
• Error due to test items• Error due to testing
occasions
• Time sampling error
14
How do we deal with sources of error?
• Error due to test items• Error due to testing
occasions• Error due to testing
multiple traits
• Internal consistency error
15
Domain Sampling error
• A knowledge base or skill set containing many items is to be tested. E.g., the chemical
properties of foods.
• We can’t test the entire set of items. So we select a sample
of items. That produces domain
sampling error, as in Sarah’s spelling test.
16
Domain Sampling error
• There is a “domain” of knowledge to be tested
• A person’s score may vary depending upon what is included or excluded from the test.
17
Domain Sampling error
• Smaller sets of items may not test entire knowledge base.
• Larger sets of items should do a better job of covering the whole knowledge base.
• As a result, reliability of a test increases with the number of items on that test
18
Domain Sampling error
• Parallel Forms Reliability:
• choose 2 different sets of test items.
• these 2 sets give you “parallel forms” of the test
• Across all people tested, if correlation between scores on 2 parallel forms is low, then we probably have domain sampling error.
19
Time Sampling error
• Test-retest Reliability person taking test
might be having a very good or very bad day – due to fatigue, emotional state, preparedness, etc.
• Give same test repeatedly & check correlations among scores
• High correlations indicate stability – less influence of bad or good days.
20
Time Sampling error
• Test-retest approach is only useful for traits – characteristics that don’t change over time
• Not all low test-retest correlations imply a weak test
• Sometimes, the characteristic being measured varies with time (as in learning)
21
Time Sampling error
• Interval over which correlation is measured matters
• E.g., for young children, use a very short period (< 1 month, in general)
• In general, interval should not be > 6 months
• Not all low test-retest correlations imply a weak test
• Sometimes, the characteristic being measured varies with time (as in learning)
22
Time sampling error
• Test-retest approach advantage: easy to evaluate, using correlation
• Disadvantage: carryover & practice effects
• Carryover: first testing session influences scores on next session
• Practice: when carryover effect involves learning
23
Internal Consistency error
• Suppose a test includes both items on social psychology and items requiring mental rotation of abstract visual shapes.
• Would you expect much correlation between scores on the two parts? No – because the two
‘skills’ are unrelated.
24
Internal Consistency Approach
• A low correlation between scores on 2 halves of a test, suggests that the test is tapping two different abilities or traits.
• A good test has high correlations between scores on its two halves. But how should we
divide the test in two to check that correlation?
25
Internal Consistency error
• Split-half method• Kuder-Richardson
formula• Cronbach’s alpha
• All of these assess the extent to which items on a given test measure the same ability or trait.
26
Split-half Reliability
• After testing, divide test items into halves A & B that are scored separately.
• Check for correlation of results for A with results for B.
• Various ways of dividing test into two – randomly, first half vs. second half, odd-even…
27
Split-half Reliability – a problem
• Each half-test is smaller than the whole
• Smaller tests have lower reliability (domain sampling error)
• So, we shouldn’t use the raw split-half reliability to assess reliability for the whole test
28
Split-half reliability – a problem
• We correct reliability estimate using the Spearman-Brown formula: re = 2rc
1+ rc
re = estimated reliability for the test
rc = computed reliability (correlation between scores on the two halves A and B)
29
Kuder-Richardson 20
• Kuder & Richardson (1937): an internal-consistency measure that doesn’t require arbitrary splitting of test into 2 halves.
• KR-20 avoids problems associated with splitting by simultaneously considering all possible ways of splitting a test into 2 halves.
30
Kuder-Richardson 20
• The formula contains two basic terms:
1. a measure of all the variance in the whole set of test results.
31
Kuder-Richardson 20
• The formula contains two basic terms:
2. “item variance” – when items measure the same trait, they co-vary (same people get them right or wrong). More co-variance = less “item variance”
32
Internal Consistency – Cronbach’s α
• KR-20 can only be used with test items scored as 1 or 0 (e.g., right or wrong, true or false).
• Cronbach’s α (alpha) generalizes KR-20 to tests with multiple response categories.
• α is a more generally-useful measure of internal consistency than KR-20
33Review: How do we deal with sources of error?
Approach Measures Issues
Test-Retest Stability of scoresCarryover
Parallel Forms Equivalence & Stability Effort
Split-half Equivalence & Internal Shortenedconsistency test
KR-20 & α Equivalence & Internal Difficult to
consistency calculate
34
Reliability in Observational Studies
• Some psychologists collect data by observing behavior rather than by testing.
• This approach requires time sampling, leading to sampling error
• Further error due to: observer failures inter-observer
differences
35
Reliability in Observational Studies
• Deal with possibility of failure in the single-observer situation by having more than 1 observer.
• Deal with inter-observer differences using: Inter-rater reliability Kappa statistic
36
Reliability in Observational Studies
• Inter-rater reliability • % agreement between 2 or more observers problem: in a 2-choice
case, 2 judges have a 50% chance of agreeing even if they guess!
this means that % agreement may over-estimate inter-rater reliability.
37
Reliability in Observational Studies
• Kappa Statistic (Cohen,1960)
• estimates actual inter-rater agreement as a proportion of potential inter-rater agreement after correction for chance.
38
Using Reliability Information
• Standard error of measurement (SEM)
• estimates extent to which test score misrepresents a true score.
• SEM = (S)(1 – r)
39
Standard Error of Measurement
• We use SEM to compute a confidence interval for a particular test score.
• The interval is centered on the test score
• We have confidence that the true score falls in this interval
• E.g., 95% of the time the true score will fall within 1.96 SEM either way of the test (observed) score.
40
Standard Error of Measurement
• A simple way to think of the SEM:
• Suppose we gave one student the same test over and over
• Suppose, too, that no learning took place between tests and the student did not memorize questions
• The standard deviation of the resulting set of test scores (for this one student) would be the standard error of measurement.
41
What to do about low reliability
• Increase the number of items
• To find how many you need, use Spearman-Brown formula
• Using more items may introduce new sources of error such as fatigue, boredom