brent duckor ph.d. (sjsu) april 22, 2014bearcenter.berkeley.edu/sites/default/files/duckor &...
TRANSCRIPT
Brent Duckor Ph.D. (SJSU) Kip Tellez, Ph.D. (UCSC)
BEAR Seminar April 22, 2014
Studies under review ELA event
Duckor, B., Castellano, K., Téllez, K., & Wilson, M. (2013, April). Validating the internal structure of the Performance Assessment for California Teachers (PACT): A multi-dimensional item response model study. Paper presented at the annual meeting of the American Educational Research Association conference, San Francisco, California.
Mathematics event
Castellano, K., Duckor, B., Téllez, K., Wihardini, D., & Wilson, M. (2013, April). Validity evidence for the internal structure of the Performance Assessment for California Teachers (PACT): Examining the elementary mathematics teaching event. Paper presented at the annual meeting of the National Council on Measurement in Education conference, San Francisco, California.
Becoming a CA teacher Ø CA license requirements (“Pre-service”)
• Subject competency tests (CSET)
• Coursework
• TPA (PACT)
Ø CA license requirements (“Preliminary” In-service)
• BTSA Induction program
Ø “Clear” Credential & Full License
For teacher candidates��� ü The PACT is a standardized
licensure “exam” over several weeks constructed in the field
ü Comprised of multiple constructed responses tasks
ü It is “high stakes” in the sense that it has consequences
Instrument Content Primer
Teaching Event is a constructed extended response items design It includes wri8en responses to task prompts It also includes video clips (e.g. two 10 minute clips) And ar@facts such as lesson plans, instruc@onal materials, examples of student work, assessments, etc.
Validity
Validity: Working Definition
“The degree to which evidence and theory support the interpretation of test scores entailed by proposed
uses of tests."
(AERA, APA, & NCME, 1999)
Translated into plain English
“The degree to which the written and video evidence provided by teacher candidate supports the interpretation of
individual’s scores (on 5 domains and 12 tasks) to determine if that individual will enter the California public school
classroom."
(AERA, APA, & NCME, 1999)
Content Validity is not enough
“Teacher educators who participated in the development and design of the assessments were asked to judge the extent to which the content of the Teaching Events was an authentic representation of important dimensions of teaching. Another study examined the alignment of the TE tasks to the California Teaching Performance Expectations (TPEs). Overall, the findings across all content validity activities suggest a strong linkage between the TPE standards, the TE tasks and the skills and abilities that are needed for safe and competent professional practice.” (Technical Manual, 2007, pp. 25-27)[Emphasis added]
Validity Evidence
Content
Response processes
Internal structure
Relations to external variables
Consequences
As Kane notes (1994) • The plausibility of an interpretation depends on
evidence supporting the proposed interpretation and refuting competing interpretations.
• Moreover, we should expect that different types of validity evidence will be relevant to different parts of the argument.
• Claims that the situations included in licensure examinations are representative of the situations encountered in some are of practice could be supported by expert judgment or by empirical data.
Validation to support responsible use
q Content Validity: Does the developers demonstrate coverage of curriculum with a sufficient amount of tasks for each topic area to ensure the meaningfulness of the score results?
q Response Processes Validity: Did the developers interview candidates to check how 9me, energy, mo9va9on, confusion, language facility, wri9ng ability, test-‐wiseness, etc. may have reduced/overstated the meaningfulness of the score results?
q Internal Structure Validity: Does a sta4s4cal analysis of “dimensionality” of the teacher candidates’ results indicate that we are measuring what we intend to measure?
q Rela@ons to other external variables Validity: Do the PACT score results from correlate with those of other scores e.g. field placement scored from university supervisors and coopera@ng teachers ?
q Consequen@al Validity: Does the PACT event lead to “teaching to the test” or eliminate other “non-‐tested” content from teacher educa@on curriculum or otherwise harm the novice teacher learning experience?
Our study: Focus on Internal Strucure Validity Claims
• Objective: Examine evidence for internal structure of the Elementary Literacy (EL) PACT scores
• Theoretical framework: Validation study to address to two research questions:
• To what extent does an IRT model fit the EL PACT instrument and aid in describing teacher candidate “ability” and task “difficulty” across the State?
• Is there evidence that the EL PACT assesses multiple constructs other than intended?
• Methods & Data Analyses
• Sampling: 2008-2010 (n=1, 711)
• Scoring & Data (production, masked, no student ID)
• Statistical procedures: Partial credit model (IRT) for unidimensional and multidimensional investigation
• Results
Summary Stats Table 1 Summary Statistics by Item
Time Statistic
Items by Domain
Planning Instruction Assessment Reflection Academic Language
P1 P2 P3 I4 I5 A6 A7 A8 R9 R10 AL11 AL12
2008-2009
N 407 407 406 407 407 407 407 311 407 407 407 407 Mean 2.74 2.74 2.59 2.54 2.45 2.63 2.44 2.54 2.55 2.54 2.18 2.52 SD 0.72 0.76 0.71 0.74 0.74 0.78 0.76 0.81 0.71 0.78 0.71 0.67
2009-2010
N 1304 1303 1304 1304 1304 1304 1304 1297 1303 1304 1304 1303 Mean 2.83 2.82 2.73 2.56 2.43 2.61 2.47 2.56 2.50 2.51 2.27 2.46 SD 0.65 0.70 0.68 0.71 0.71 0.75 0.75 0.78 0.71 0.74 0.67 0.62
Overall
N 1711 1710 1710 1711 1711 1711 1711 1608 1710 1711 1711 1710 Mean 2.81 2.80 2.70 2.56 2.43 2.61 2.46 2.56 2.51 2.52 2.25 2.48 SD 0.66 0.71 0.69 0.71 0.72 0.76 0.75 0.78 0.71 0.75 0.68 0.63
Constructs: Readiness to teach
in 5 Domains
Items Design: 12 Tasks
Outcome Space:
12 Rubrics
Reliability
Validity
Measurement Model:
IRT scale scores
Findings • Unidimensional Partial Credit Model
• Partial Credit Model fits better than the Rating Scale Model
• The step difficulty parameters band together
• The Planning items tend to be the easiest and the Academic Language the most difficult
IRT and Rasch Models • In the Rasch model, the probability of a
specified response is modeled as a function of person and item parameters
• Person (theta) parameters = teacher candidates’ “proficiency”
• Item (delta) parameters= “task difficulties”
81
(a) (b) (c)
Figure 12. Representation of possible relationships between respondent and item location
(adapted from Wilson, 2005).
As shown in case (a) of Figure 12, if a respondentís location on the person (right-
hand) side of the Wright map is above the item location on the left-hand side, he or she is
more than 50% likely to make that response. In this case, the probability governing a
correct response suggests that the items below him or her are relatively ìeasierî because
he or she has more of the construct, that is, proficiency for the given dimension.
As shown in case (c) of Figure 12, if a respondentís location on the right-hand
side of the Wright Map is below the item location on the left-hand side, then he or she is
less than 50% likely to make that response. In this case, the probability governing a
correct response suggests that the items below him or her are relatively ìharderî because
he or she has less of the construct, that is, proficiency for the given dimension.
As shown in case (b) of Figure 12, if a respondentís location on the right-hand
side of the Wright map is co-equal with an item location of the left-hand side, then he or
δi
θ
δi θ δi
θ
Person Ability Item Score Thresholds | 6 | X| X| | 5 X| X|AL12.4 X|AL11a.4 AL11b.4 X| 4 XX|P3a.4 I5.4 R9.4 XX|I4.4 A7.4 XX|P3b.4 A8.4 R10.4 XXX|P1.4 A6.4 3 XXX|P2.4 XXXX| XXXXXX| 2 XXXXXX|AL11a.3 XXXXXX| XXXXXX|AL11b.3 XXXXXXX| 1 XXXXXXXX| XXXXXXXXXX|I5.3 A7.3 R10.3 XXXXXXXX|R9.3 AL12.3 XXXXXXXXX|I4.3 A6.3 A8.3 0 XXXXXXXX|P3a.3 XXXXXXXXX| XXXXXXXXX|P2.3 P3b.3 -1 XXXXXXXXX| XXXXXXXXX|P1.3 XXXXXXXX| XXXXX| -2 XXXXX| XXXXX| XXX| XX|AL11a.2 -3 XX| X|A8.2 AL11b.2 X|A7.2 X|I5.2 -4 X|I4.2 R9.2 R10.2 X|P3a.2 A6.2 |AL12.2 -5 |P3b.2 |P1.2 |P2.2 | -6 |
Threshold for Score Level 4 (Ability level at which examinee has 50 percent chance of obtaining a score of 4)
Threshold for Score Level 3 (Ability level at which examinee has 50 percent chance of obtaining a score of 3 or higher)
Threshold for Score Level 2 (Ability level at which examinee has 50 percent chance of obtaining a score of 2 or higher)
Findings
• Multidimensional Partial Credit Model
• 4D (task based) & 5D (domain based) models fit less well (AIC) than 3D model
Task-based model
P1 P2 P3 I4 I5 A6 A7 A8 R9 R10
Task 3 Tasks 1 & 2 Task 4 Task 5
AL11 AL12
Instruction Planning Assessment Reflection Academic Language
P1 P2 P3 I4 I5 A6 A7 A8 R9 R10 AL11 AL12
(a)
(b)
(c)
P1 P2 P3 I4 I5 A6 A7 A8 R9 R10
Assessment, Reflection & Academic Language Planning
AL11 AL12
Instruction
Domain based model
P1 P2 P3 I4 I5 A6 A7 A8 R9 R10
Task 3 Tasks 1 & 2 Task 4 Task 5
AL11 AL12
Instruction Planning Assessment Reflection Academic Language
P1 P2 P3 I4 I5 A6 A7 A8 R9 R10 AL11 AL12
(a)
(b)
(c)
P1 P2 P3 I4 I5 A6 A7 A8 R9 R10
Assessment, Reflection & Academic Language Planning
AL11 AL12
Instruction
Findings
• Additionally, pairwise correlations are not as high as for modified 3D model
Correlations Table 2 Observed Correlations between Mean Domain Scores (above diagonal) and Disattenuated Correlations between Domains/Dimensions (below diagonal)
Planning Instruction Assessment Reflection Academic Language
Planning 0.64 0.64 0.63 0.65 Instruction 0.81 0.61 0.62 0.61 Assessment 0.80 0.79 0.70 0.67 Reflection 0.82 0.84 0.92 0.67 Academic Lang 0.84 0.84 0.90 0.95
Findings
• There is evidence for “Planning” and “Instructing” and “Meta-Reflecting” proficiencies fitting a 3D model
Dimension based model
P1 P2 P3 I4 I5 A6 A7 A8 R9 R10
Task 3 Tasks 1 & 2 Task 4 Task 5
AL11 AL12
Instruction Planning Assessment Reflection Academic Language
P1 P2 P3 I4 I5 A6 A7 A8 R9 R10 AL11 AL12
(a)
(b)
(c)
P1 P2 P3 I4 I5 A6 A7 A8 R9 R10
Assessment, Reflection & Academic Language Planning
AL11 AL12
Instruction
27
Dimension 1 - Planning Dimension 2 - Instruction
Dimension 3 - Assessment, Reflection, Academic Language
Person Ability
Item Score
Threshold
Person Ability
Item Score
Threshold
Person Ability
Item Score Threshold
9
|
|
|
|
|
|
X |
|
|
8
|
|
|
X |
|
|
7
|
X |
|
X |
|
|
X |
X |
X |
6 X |
X |
X |
X |
|
X |
X |
X |
|
5 XX | P3.4 XX |
X | AL12.4
XX | P1.4 X | I4,4,I5.4 X | A6.4,AL11a.4
4 XX |
XXX |
X | R9.4
XXX | P2.4 XX |
XX | A8a.4
XXX |
XX |
XXX | A8b.4,R10.4
3 XXX |
XXX |
XXX | A7.4
XXX |
XXXX |
XXXX |
XXXXXXXX |
XXXXXXX |
XXXX |
2 XXXXXXX |
XXXXXX |
XXXXXXXX | AL11a.3
XXXXXXXXX |
XXXXXXX |
XXXXXXXXX | AL11b.3
1 XXXXXXXX |
XXXXX |
XXXXX |
XXXXXXX |
XXXXXXX |
XXXXXXXXX | A6.3,A8a.3
XXXXXX |
XXXXXXXXX |
XXXXXXX | A7.3,A8b.3,R9.3,R10.3,AL12.3
0 XXXXX | P3.3 XXXXXXX | I5.3 XXXXXXX |
XXXXXX |
XXXXXXX |
XXXXXXXX |
XXXXXXX | P2.3 XXXXXXX | I4.3 XXXXXXXXX |
-1 XXXXX | P1.3 XXXXXXX |
XXXXXXXXX |
XXXXXX |
XXXXX |
XXXXXXXX |
-2 XXXX |
XXXXXX |
XXXXXXXX |
XXXX |
XXXXXXXXX |
XXXX |
XXXXXXX |
XXX |
XXX |
-3 XXXXX |
XX |
X | AL11a.2
X |
XXX |
X | A8b.2,AL11b.2
X |
|
X | A8a.2
-4 X |
X |
X | R9.2,R10.2
X |
X |
X | A6.2,A7.2
-5
|
X | I5.2
| AL14.2
X |
|
|
| P3.2
|
|
-6
|
X | I4.2
|
| P1.2,P2.2
|
|
|
|
|
-7
|
|
|
Threshold for Score Level 4 (Ability level at which examinee has 50 percent chance of obtaining a score of 4 or higher)
Threshold for Score Level 3 (Ability level at which examinee has 50 percent chance of obtaining a score of 3 or higher)
Threshold for Score Level 2 (Ability level at which examinee has 50 percent chance of obtaining a score of 2 or higher)
Literacy
(AIC=
33058)
Confused?
Take aways • Validation matters—it depends on particular uses and consequences
of the score data
• There is evidence for unidimensional internal structure of PACT scale on production data to make license decision (who is “in” and who is “out” of the teaching in CA public schools)
• There is evidence for person separation (r=.918) across unidimensional scale but rater effects (halo, drift) not yet well understood
• There is evidence for multi-dimensionality not envisioned by developers which undermines strong claims by domain
• Sub-score reporting and decomposition (e.g. Academic Language, Assessment, Reflection) not warranted given MD findings
Eternal Return: Validation to support responsible use
q Content Validity: Does the developers demonstrate coverage of curriculum with a sufficient amount of tasks for each topic area to ensure the meaningfulness of the score results?
q Response Processes Validity: Did the developers interview candidates to check how 9me, energy, mo9va9on, confusion, language facility, wri9ng ability, test-‐wiseness, etc. may have reduced/overstated the meaningfulness of the score results?
q Internal Structure Validity: Does a sta4s4cal analysis of “dimensionality” of the teacher candidates’ results indicate that we are measuring what we intend to measure?
q Rela@ons to other external variables Validity: Do the PACT score results from correlate with those of other scores e.g. field placement scored from university supervisors and coopera@ng teachers ?
q Consequen@al Validity: Does the PACT event lead to “teaching to the test” or eliminate other “non-‐tested” content from teacher educa@on curriculum or otherwise harm the novice teacher learning experience?
Next steps: No short cuts 1. Strong Theory Statements/Claims derived from latent regression analyses
to investigate effect of campus on person ability estimates are NOT warranted:
q E.g., “CSU3”, “CSU4”, and “UC1” all have significantly higher mean ability estimates
q E.g., “CSU1” has significantly lower mean ability estimates
2. Problem of interpreting score results because of confounding of individual candidates, campus, program “treatments” and raters
q I.e., Multi-level, multi-dimensional study must conducted with a richer data set to control for unobserved variation
3. Current construct definitions and instrumentation may not detect individual differences in consistent or meaningful ways at “grain size” to offer formative feedback to candidates on Assessment or AL tasks
Thank you
For more info. contact: [email protected]