james l. woodworth, credo hoover institute, stanford wen-juo lo, university of arkansas
Post on 03-Feb-2016
38 Views
Preview:
DESCRIPTION
TRANSCRIPT
The Impact of Selection of Student Achievement Measurement Instrument
on Teacher Value-added Measures
James L. Woodworth, CREDO Hoover Institute, Stanford
Wen-Juo Lo, University of Arkansas
Joshua B. McGee, Laura and John Arnold Foundation
Nathan C. Jensen, Northwest Evaluation Association
Presentation Outline
1. Purpose
2. Statistical Noisea. Why it matters
b. Sources
3. Data
4. Methods
5. Results
Purpose
The purpose of this paper is to present to a statistics lay population the extent to which psychometric properties of student test instruments impact teacher value-added measures.
1.Purpose 2.Statistical Noise 3.Data 4.Methods 5.Results
Question
What is the impact of statistical noise introduced by different test characteristics on the stability and accuracy of value-added models?
1.Purpose 2.Statistical Noise 3.Data 4.Methods 5.Results
Why it matters?
5th
6th
Below Basic
Basic AdvancedProficient
1.Purpose 2.Statistical Noise 3.Data 4.Methods 5.Results
Primary Sourcesof Statistical Noise
1. Test Design
2. Vertical Alignment
3. Student Sample Size
1.Purpose 2.Statistical Noise 3.Data 4.Methods 5.Results
Test Design
Proficiency Tests
• Focused around proficiency point
• Designed to differentiate between proficient and not proficient
• Larger variance in Conditional Standard Errors (CSE)
Growth Tests
• Questions measure across entire ability spectrum
• Designed to differentiate between all points on the distribution
• Smaller variance in CSE
1.Purpose 2.Statistical Noise 3.Data 4.Methods 5.Results
Test Design
Paper and Pencil Tests
• Limit item pool to control length
• Focused around proficiency point
• Large variance in CSE
Computer Adaptive Test
• Larger item pool for question selection
• Focused around student ability point
• Smaller variance in CSE
1.Purpose 2.Statistical Noise 3.Data 4.Methods 5.Results
Test DesignCSE Heteroskedasticity Due to Item Focusing: TAKS Reading Grade 5, 2009
CSE Range: 24 - 74Weighted average CSE = 38.96
1.Purpose 2.Statistical Noise 3.Data 4.Methods 5.Results
Vertical Alignment
• Year to year alignment can impact the results of VAM– Units must be equal across test sessions• Spring-Spring VAM are most affected
• Fall-Spring VAM using same test avoid much of problem
• Item alignment on computer adaptive tests can impact the results of VAM
1.Purpose 2.Statistical Noise 3.Data 4.Methods 5.Results
Student Sample Size
• Central Limit Theorem– Larger student n provides a more stable estimate of
teacher VAM.
– Typical single year student n’s are 25, 50, and 100 for elementary and middle school teachers.
1.Purpose 2.Statistical Noise 3.Data 4.Methods 5.Results
Question
What is the impact of statistical noise introduced by different test characteristics on the stability and accuracy of value-added models?
1.Purpose 2.Statistical Noise 3.Data 4.Methods 5.Results
Data Sets
TAKS – Texas Assessment of Knowledge and Skills: Grade 5 Reading, 2009 Population Statistics– Proficiency test– Vertically aligned scale scores– Average yearly gain
• 24 vertical scale points at “Met Expectations”• 34 vertical scale points at “Commended”
– Standard Errors – Conditional Standard Errors reported by TEA for each vertical scale score• CSE Range: 24 - 74• Weighted average CSE = 38.96
– Highly skewed distribution– High variance
1.Purpose 2.Statistical Noise 3.Data 4.Methods 5.Results
Data Sets
TAKS – Texas Assessment of Knowledge and Skills: Grade 5 Reading
N: 323,507μ: 701.49σ2: 10048.30σ: 100.24
1.Purpose 2.Statistical Noise 3.Data 4.Methods 5.Results
1.Purpose 2.Statistical Noise 3.Data 4.Methods 5.Results
Data Sets
MAP – Measures of Academic Progress– Growth measure
– Computer Adaptive Test
– Single scale
– Average yearly gain• 5.06 RIT points
– Standard Errors – average standard errors range 2.5 - 3.5 RIT
– Slightly skewed distribution
– Small variance
1.Purpose 2.Statistical Noise 3.Data 4.Methods 5.Results
Data Sets
MAP – Measures of Academic Progress
N: 2,663,382
μ: 208.35
σ2: 161.82
σ: 12.72
1.Purpose 2.Statistical Noise 3.Data 4.Methods 5.Results
Simulated Data
As it is impossible to isolate true scores and error with real data, we created simulated data points.– True scores are known for all data points
– Every data point was given the same growth• All iterations have the same value-added
• Any deviation from expected is a function of measurement error only
1.Purpose 2.Statistical Noise 3.Data 4.Methods 5.Results
Simulated Data
We simulated 10,000 z-scores ~ N (0,1)
From this we selected nested, random samples of n=100, n=50, n=25.
Statistical Summary, z-Score Samples by n
Statistic Values
N 100 50 25
Mean -.13 -.09 .01
Std. Deviation .97 .97 1.00
Skewness -.12 .18 .10
Minimum -2.34 -1.85 -1.77
Maximum 2.09 2.09 2.09
1.Purpose 2.Statistical Noise 3.Data 4.Methods 5.Results
Data Generation
Pre-scores = P1 = z-score • σ +
Post-scores = P2 = P1 + controlled growth
Controlled Growth Values:TAKS = 24 (TAKS at “Commended” = 34) vertical scale points
MAP = 5.06 RIT points
Simulated Growth = (P2 + (Random2 • CSE)) - (P1 + (Random1 • CSE))Random1 and Random2 ~ N (0,1)
CSE = Conditional Standard Errors as reported by TEA and NWEA
x
1.Purpose 2.Statistical Noise 3.Data 4.Methods 5.Results
Question
What is the impact of statistical noise introduced by different test characteristics on the stability and accuracy of value-added models?
1.Purpose 2.Statistical Noise 3.Data 4.Methods 5.Results
Monte Carlo Simulation
We ran 1,000 iterations for each simulation which was equivalent to the same students taking the test 1,000 times with the same true scores, but different levels of error.
Simulated Growth = (P2 + (Random2 • CSE)) - (P1 + (Random1 • CSE))Random1 and Random2 ~ N (0,1)CSE = Conditional Standard Errors as reported by TEA and NWEA
Aggregated values by subgroup to determine average performance for each iteration. False Negative : Simulated Growth < .5 Controlled GrowthFalse Positive: Simulated Growth > 1.5 Controlled Growth
1.Purpose 2.Statistical Noise 3.Data 4.Methods 5.Results
Monte Carlo Results n=100 % False Negative
% False Positive
% Total Correct
IDTAKS Actual Distribution 1.7 2.5 95.8TAKS Normal Distribution at “Meets” Level .9 1.8 97.3TAKS Normal Distribution Avg SE 1.2 1.8 97.0TAKS Normal Distribution at “Commended” Level
.8 .2 99.0
TAKS Normal Grade Transition 1.4 2.1 96.5MAP Normal 0.0 0.0 100.0MAP Max CSE 0.0 0.0 100.0
Results1.Purpose 2.Statistical Noise 3.Data 4.Methods 5.Results
Monte Carlo Results n=50 % False Negative
% False Positive
% Total Correct
IDTAKS Actual Distribution 7.4 9.6 83.0TAKS Normal Distribution at “Meets” Level 6.6 8.4 85.0TAKS Normal Distribution Avg SE 5.7 7.4 86.9TAKS Normal Distribution at “Commended” Level
4.4 1.7 93.9
TAKS Normal Grade Transition 6.5 8.1 85.4MAP Normal 0.0 0.0 100.0MAP Max CSE .7 .6 98.7
Results1.Purpose 2.Statistical Noise 3.Data 4.Methods 5.Results
Monte Carlo Results n=25 % False Negative
% False Positive
% Total Correct
IDTAKS Actual Distribution 16.1 18.4 65.5TAKS Normal Distribution at “Meets” Level 16.8 18.0 65.2TAKS Normal Distribution Avg SE 14.5 16.0 69.5TAKS Normal Distribution at “Commended” Level
10.2 7.7 82.1
TAKS Normal Grade Transition 18.6 18.2 63.2MAP Normal .5 .5 99.0MAP Max CSE 3.0 4.2 92.8
Results1.Purpose 2.Statistical Noise 3.Data 4.Methods 5.Results
ResultsStudent Sample Size n=100 n=50 n=25
Descriptive Statistics VAM
Controlled Growth
Average Simulated Growth SD
Average Simulated Growth SD
Average Simulated Growth SD
TAKS Actual Distribution 24 24.29 6.02 24.26 8.78 24.18 12.28
TAKS Normal Distribution at “Meets”
24 24.08 5.45 24.45 8.37 24.14 12.39
TAKS Normal Distribution Avg SE
24 24.19 5.45 24.61 8.03 24.59 11.47
TAKS Normal Distribution at “Commended”
34 33.85 5.60 34.15 8.12 34.92 11.87
TAKS Normal Grade Transition
24 24.08 5.59 24.24 8.59 24.15 12.85
MAP Normal 5.06 5.07 .49 5.12 .72 5.12 1.03MAP Max CSE 5.06 5.05 .71 5.05 .99 5.08 1.37
Test
Percent misidentified at
n=100
Percent misidentified at
n=50
Percent misidentified at
n=25TAKS Normal Distribution at “Meets” 2.7 15.0 34.8MAP Normal 0.0 0.0 1.0
1.Purpose 2.Statistical Noise 3.Data 4.Methods 5.Results
Conclusions
The Growth/Error ratio is the critical variable in VAM stability.
Necessary student n to achieve a stable VAM is sensitive to the Growth/Error ratio.
Stable VAMs are possible even with typical classroom n’s; however, careful attention must be paid to the suitability of the student assessment instrument.
Limitations
No Differentiation between Student Effects, Teacher Effects, or School Effects
No Environmental Effects
No Interaction Terms
These are all areas for additional research.
top related