developing automated scoring for large-scale assessments...
Post on 19-Feb-2020
16 Views
Preview:
TRANSCRIPT
Developing Automated Scoring for Large-scale Assessments of
Three-dimensional LearningJay Thomas1, Ellen Holste2, Karen Draney3, Shruti Bathia3, and Charles W. Anderson2
1. ACT, Inc.
2. Michigan State University
3. UC Berkeley, BEAR Center
Based on the NRC Developing Assessments for the Next Generation Science Standards (Pellegrino et al, 2014)
• Need assessment tasks with multiple components to get at all 3 dimensions (C 2-1)
• Tasks must accurately locate students along a sequence of progressively more complex understanding (C 2-2)
• Traditional selected-response items cannot assess the full breadth and depth of NGSS
• Technology can address some of the problems• Particularly scalability and cost
Example of a Carbon TIME Item
Comparing FC vs CR vs Both
• Compare spread of data
• Adding CR (or CR only) increases the confidence that we have classified students correctly
• Since explanations is a practice that we are focusing on in the LP, it requires CR to assess the construct fully
Item Development
Students respond to
Items
WEW (Rubric) Development
Using WEW (Human
scoring) to create
training set
Creating Machine Learning
(ML) Models
Using ML Model
(Computer scoring)
Backcheck coding
(human)
QWK Check for
Reliability
Psychometric Analysis
(IRT, WLE)
Interpretation by larger research
group
Recursive Feedback Loops for Item Development
Processes moving towards final interpretationFeedback loops that indicate that a question, rubric, or coding potentially has a problem that needs to be addressed
Consequences of using machine scoring
• Item revision and improvement• Increase in the size of the usable data set to
increase power of statistics• Increased confidence in reliability of scoring
through back-checking samples and revising models
• Reduced costs by needing fewer human coders• Model to show that the kinds of assessments
envisioned by Pellegrino et al (2014) for NGSS can be reached at scale with low cost
As of March 6, 2019
School Year Responses Scored
2015-16 175,265
2016-17 532,825
2017-18 693,086
2018-19 227,041
TOTAL 1,628,217
Cost Savings and scalability
Labor hours needed to human score responses @ 100 per hour 16,282.7 hoursLabor cost per hour (undergraduate students includingmisc. costs)
$18 per hour
Cost to human score all responses $293,079
Types of validity evidence
• As taken from the Standards for Educational and Psychological Testing, 2014 ed.• Evidence based on test content
• Evidence based on response processes
• Evidence based on internal structure
• Evidence based on relation to other variables• Convergent and discriminant evidence
• Test-criterion evidence
• Evidence for validity and consequences of testing
Comparison of interviews and IRT analysis results
• Overall Spearman rank correlation = 0.81, p<0.01, n=49
• Comparison of scoring for one written versus interview item
Evidence based on internal structure
• Analysis method: item response models (specifically, unidimensional and multidimensional partial credit models)
• Provide item and step difficulties and person proficiencies on one scale
• Provide comparisons of step difficulties within items
Step difficulties for each item2015-16 Data
Classifying Students into LP levelsComparing FC to EX + FC
Classifying Students into LP levelsComparing EX to EX + FC
Classifying Classroom Data
95% confidence intervals: Average learning gains for teachers with at least 15 students who had both overall pretests and overall posttests (macroscopic explanations)
Questions?
• Contact info • Jay Thomas jay.thomas@act.org
• Karen Draney kdraney@berkeley.edu
• Andy Anderson andya@msu.edu
• Ellen Holste holste@msu.edu
• Shruti Bathia shruti_bathia@berkeley.edu
top related