reliability consistency in testing. types of variance meaningful variance –variance between test...
TRANSCRIPT
Reliability
Consistency in testing
Types of variance
• Meaningful variance– Variance between test takers which reflects
differences in the ability or skill being measured
• Error variance– Variance between test takers which is caused
by factors other than differences in the ability or skill being measured
• Test developers as ‘variance chasers’
Sources of error variance
• Measurement error
• Environment
• Administration procedures
• Scoring procedures
• Examinee differences
• Test and items
• Remember, OS = TS + E
Estimating reliability for NRTs
• Are the test scores reliable over time?Would a student get the same score if tested tomorrow?
• Are the test scores reliable over different forms of the same test?
Would the student get the same score if given a different form of the test?
• Is the test internally consistent?
Reliability coefficient (rxx)
• Range: 0.0 (totally unreliable test) to 1.0 (perfectly reliable test)
• Reliability coefficients are estimates of the systematic variance in the test scores
• lower reliability coefficient = greater measurement error in the test score
Test-retest reliability
1. Same students take test twice
2. Calculate reliability (Pearson’s r)
3. Interpret r as reliability (conservative)
• Problems– Logistically difficult – Learning might take place between tests
Equivalent forms reliability
1. Same students take parallel forms of test
2. Calculate correlation
• Problems– Creating parallel forms can be tricky– Logistical difficulty
University of Michigan English Placement Test
(University of Michigan English Placement Test Examiner’s Manual)
Internal consistency reliability
• Calculating the reliability from a single administration of a test
• Commonly reported– Split-half– Cronbach alpha– K-R20– K-R21
• Calculated automatically by many statistical software packages
Split-half reliability
1. The test is split in half (e.g., odd / even) creating “equivalent forms”
2. The two “forms” are correlated with each other
3. The correlation coefficient is adjusted to reflect the entire test length
– Spearman-Brown Prophecy formula
Calculating split half reliability
ID Q1 Q2 Q3 Q4 Q5 Q6 Odd Even
1 1 0 0 1 1 0
2 1 1 0 1 0 1
3 1 1 1 1 1 0
4 1 0 0 0 1 0
5 1 1 1 1 0 0
6 0 0 0 0 1 0
2
1
3
2
2
1
1
3
2
0
2
0
OddMean 1.83
SD 0.75
Even
Mean 1.33
SD 1.21
Calculating split half reliability (2)
Odd Mean Diff Even Mean Diff Prod.
2 1.83 1 1.33
1 1.83 3 1.33
3 1.83 2 1.33
2 1.83 0 1.33
2 1.83 2 1.33
1 1.83 0 1.33
0.17
-0.83
1.17
0.17
0.17
-0.83
-0.33
1.67
0.67
-1.33
0.67
-1.33
-0.056-1.386
0.784
-0.2260.114
1.104
0.334
Calculating split half
0.334
(6)(.75)(1.21)= 0.06
Adjust for test length using Spearman-Brown Prophecy formula
2 x 0.06(2 – 1)0.06 +1
rxx =0.11
Cronbach alpha
• Similar to split half but easier to calculate
2 (1 - (0.75)2 + (1.21)2
(1.47)2
) = 0.12
total
evenodd
S
SS2
22
12
K-R20
• “Rolls-Royce” of internal reliability estimates
• Simulates calculating split-half reliability for every possible combination of items
K-R20 formula
Note that this is variance, not standard deviation
Sum of Item Variance = the sum of IF(1-IF)
2
2
11
20t
i
S
S
k
kRK
K-R21
• Slightly less accurate than KR-20, but can be calculated with just descriptive statistics
• Tends to underestimate reliability
KR-21 formula
Note that this is variance (standard deviation squared)
2
)(1
121
kS
MkM
k
kRK
Test summary report (TAP)
Number of Items Excluded = 0Number of Items Analyzed = 40Mean Item Difficulty = 0.597Mean Item Discrimination = 0.491Mean Point Biserial = 0.417Mean Adj. Point Biserial = 0.369KR20 (Alpha) = 0.882KR21 = 0.870SEM (from KR20) = 2.733# Potential Problem Items = 9High Grp Min Score (n=15) = 31.000Low Grp Max Score (n=14) = 17.000
Split-Half (1st/ 2nd) Reliability = 0.307 (with Spearman-Brown = 0.470)Split-Half (Odd/Even) Reliability = 0.865 (with Spearman-Brown = 0.927)
Standard Error of Measurement
If we give a student the same test repeatedly (test-retest), we would expect to see some variation in the scores
50 49 52 50 51 49 48 50
With enough repetition, these scores would form a normal distribution
We would expect the student to score near the center of the distribution the most often
Standard Error of Measurement
• The greater the reliability of the test, the smaller the SEM
• We expect the student to score within one SEM approximately 68% of the time
• If a student has a score of 50 and the SEM is 3, we expect the student to score between 47 ~ 53 approximately 68% of the time on a retest
Interpreting the SEM
For a score of 29: (K-R21)
26 ~ 32 is within 1 SEM
23 ~ 35 are within 2 SEM
20 ~ 38 are within 3 SEM
Calculating the SEM
What is the SEM for a test with a reliability of r=.889 and a standard deviation of 8.124?
SEM = 2.7
What if the same test had a reliability of r = .95?
SEM = 1.8
xxrSSEM 1
Reliability for performance assessment
Traditional fixed response assessment
Performance assessment (i.e. writing, speaking)
Test-taker
Instrument (test)
Score
Test-taker
Task
Performance
Rater / judge
ScaleScore
Interrater/Intrarater reliability
1. Calculate correlation between all combinations of raters
2. Adjust using Spearman-Brown to account for total number of raters giving score