brent duckor ph.d. (sjsu) april 22, 2014bearcenter.berkeley.edu/sites/default/files/duckor &...

Brent Duckor Ph.D. (SJSU) Kip Tellez, Ph.D. (UCSC)

BEAR Seminar April 22, 2014

Studies under review ELA event

Duckor, B., Castellano, K., Téllez, K., & Wilson, M. (2013, April). Validating the internal structure of the Performance Assessment for California Teachers (PACT): A multi-dimensional item response model study. Paper presented at the annual meeting of the American Educational Research Association conference, San Francisco, California.

Mathematics event

Castellano, K., Duckor, B., Téllez, K., Wihardini, D., & Wilson, M. (2013, April). Validity evidence for the internal structure of the Performance Assessment for California Teachers (PACT): Examining the elementary mathematics teaching event. Paper presented at the annual meeting of the National Council on Measurement in Education conference, San Francisco, California.

Becoming a CA teacher Ø CA license requirements (“Pre-service”)

•  Subject competency tests (CSET)

•  Coursework

•  TPA (PACT)

Ø CA license requirements (“Preliminary” In-service)

•  BTSA Induction program

Ø “Clear” Credential & Full License

For teacher candidates�� ü The PACT is a standardized

licensure “exam” over several weeks constructed in the field

ü Comprised of multiple constructed responses tasks

ü It is “high stakes” in the sense that it has consequences

Instrument Content Primer

Teaching Event is a constructed extended response items design It includes wri8en responses to task prompts It also includes video clips (e.g. two 10 minute clips) And ar@facts such as lesson plans, instruc@onal materials, examples of student work, assessments, etc.

Validity

Validity: Working Definition

“The degree to which evidence and theory support the interpretation of test scores entailed by proposed

uses of tests."

(AERA, APA, & NCME, 1999)

Translated into plain English

“The degree to which the written and video evidence provided by teacher candidate supports the interpretation of

individual’s scores (on 5 domains and 12 tasks) to determine if that individual will enter the California public school

classroom."

(AERA, APA, & NCME, 1999)

Content Validity is not enough

“Teacher educators who participated in the development and design of the assessments were asked to judge the extent to which the content of the Teaching Events was an authentic representation of important dimensions of teaching. Another study examined the alignment of the TE tasks to the California Teaching Performance Expectations (TPEs). Overall, the findings across all content validity activities suggest a strong linkage between the TPE standards, the TE tasks and the skills and abilities that are needed for safe and competent professional practice.” (Technical Manual, 2007, pp. 25-27)[Emphasis added]

Validity Evidence

Content

Response processes

Internal structure

Relations to external variables

Consequences

As Kane notes (1994) •  The plausibility of an interpretation depends on

evidence supporting the proposed interpretation and refuting competing interpretations.

•  Moreover, we should expect that different types of validity evidence will be relevant to different parts of the argument.

•  Claims that the situations included in licensure examinations are representative of the situations encountered in some are of practice could be supported by expert judgment or by empirical data.

Validation to support responsible use

q Content Validity: Does the developers demonstrate coverage of curriculum with a sufficient amount of tasks for each topic area to ensure the meaningfulness of the score results?

q Response Processes Validity: Did the developers interview candidates to check how 9me, energy, mo9va9on, confusion, language facility, wri9ng ability, test-‐wiseness, etc. may have reduced/overstated the meaningfulness of the score results?

q  Internal Structure Validity: Does a sta4s4cal analysis of “dimensionality” of the teacher candidates’ results indicate that we are measuring what we intend to measure?

q Rela@ons to other external variables Validity: Do the PACT score results from correlate with those of other scores e.g. field placement scored from university supervisors and coopera@ng teachers ?

q Consequen@al Validity: Does the PACT event lead to “teaching to the test” or eliminate other “non-‐tested” content from teacher educa@on curriculum or otherwise harm the novice teacher learning experience?

Our study: Focus on Internal Strucure Validity Claims

•  Objective: Examine evidence for internal structure of the Elementary Literacy (EL) PACT scores

•  Theoretical framework: Validation study to address to two research questions:

•  To what extent does an IRT model fit the EL PACT instrument and aid in describing teacher candidate “ability” and task “difficulty” across the State?

•  Is there evidence that the EL PACT assesses multiple constructs other than intended?

•  Methods & Data Analyses

•  Sampling: 2008-2010 (n=1, 711)

•  Scoring & Data (production, masked, no student ID)

•  Statistical procedures: Partial credit model (IRT) for unidimensional and multidimensional investigation

•  Results

Summary Stats Table 1 Summary Statistics by Item

Time Statistic

Items by Domain

Planning Instruction Assessment Reflection Academic Language

P1 P2 P3 I4 I5 A6 A7 A8 R9 R10 AL11 AL12

2008-2009

N 407 407 406 407 407 407 407 311 407 407 407 407 Mean 2.74 2.74 2.59 2.54 2.45 2.63 2.44 2.54 2.55 2.54 2.18 2.52 SD 0.72 0.76 0.71 0.74 0.74 0.78 0.76 0.81 0.71 0.78 0.71 0.67

2009-2010

N 1304 1303 1304 1304 1304 1304 1304 1297 1303 1304 1304 1303 Mean 2.83 2.82 2.73 2.56 2.43 2.61 2.47 2.56 2.50 2.51 2.27 2.46 SD 0.65 0.70 0.68 0.71 0.71 0.75 0.75 0.78 0.71 0.74 0.67 0.62

Overall

N 1711 1710 1710 1711 1711 1711 1711 1608 1710 1711 1711 1710 Mean 2.81 2.80 2.70 2.56 2.43 2.61 2.46 2.56 2.51 2.52 2.25 2.48 SD 0.66 0.71 0.69 0.71 0.72 0.76 0.75 0.78 0.71 0.75 0.68 0.63

Constructs: Readiness to teach

in 5 Domains

Items Design: 12 Tasks

Outcome Space:

12 Rubrics

Reliability

Validity

Measurement Model:

IRT scale scores

Findings • Unidimensional Partial Credit Model

• Partial Credit Model fits better than the Rating Scale Model

• The step difficulty parameters band together

• The Planning items tend to be the easiest and the Academic Language the most difficult

IRT and Rasch Models •  In the Rasch model, the probability of a

specified response is modeled as a function of person and item parameters

•  Person (theta) parameters = teacher candidates’ “proficiency”

•  Item (delta) parameters= “task difficulties”

81

(a) (b) (c)

Figure 12. Representation of possible relationships between respondent and item location

(adapted from Wilson, 2005).

As shown in case (a) of Figure 12, if a respondentís location on the person (right-

hand) side of the Wright map is above the item location on the left-hand side, he or she is

more than 50% likely to make that response. In this case, the probability governing a

correct response suggests that the items below him or her are relatively ìeasierî because

he or she has more of the construct, that is, proficiency for the given dimension.

As shown in case (c) of Figure 12, if a respondentís location on the right-hand

side of the Wright Map is below the item location on the left-hand side, then he or she is

less than 50% likely to make that response. In this case, the probability governing a

correct response suggests that the items below him or her are relatively ìharderî because

he or she has less of the construct, that is, proficiency for the given dimension.

As shown in case (b) of Figure 12, if a respondentís location on the right-hand

side of the Wright map is co-equal with an item location of the left-hand side, then he or

δi

θ

δi θ δi

θ

Findings

• Multidimensional Partial Credit Model

• 4D (task based) & 5D (domain based) models fit less well (AIC) than 3D model

Task-based model

P1 P2 P3 I4 I5 A6 A7 A8 R9 R10

Task 3 Tasks 1 & 2 Task 4 Task 5

AL11 AL12

Instruction Planning Assessment Reflection Academic Language

P1 P2 P3 I4 I5 A6 A7 A8 R9 R10 AL11 AL12

(a)

(b)

(c)

P1 P2 P3 I4 I5 A6 A7 A8 R9 R10

Assessment, Reflection & Academic Language Planning

AL11 AL12

Instruction

Domain based model

P1 P2 P3 I4 I5 A6 A7 A8 R9 R10


AL11 AL12


P1 P2 P3 I4 I5 A6 A7 A8 R9 R10 AL11 AL12

(a)

(b)

(c)

P1 P2 P3 I4 I5 A6 A7 A8 R9 R10


AL11 AL12

Instruction

Findings

• Additionally, pairwise correlations are not as high as for modified 3D model

Correlations Table 2 Observed Correlations between Mean Domain Scores (above diagonal) and Disattenuated Correlations between Domains/Dimensions (below diagonal)

Planning Instruction Assessment Reflection Academic Language

Planning 0.64 0.64 0.63 0.65 Instruction 0.81 0.61 0.62 0.61 Assessment 0.80 0.79 0.70 0.67 Reflection 0.82 0.84 0.92 0.67 Academic Lang 0.84 0.84 0.90 0.95

Findings

• There is evidence for “Planning” and “Instructing” and “Meta-Reflecting” proficiencies fitting a 3D model

Dimension based model

P1 P2 P3 I4 I5 A6 A7 A8 R9 R10


AL11 AL12


P1 P2 P3 I4 I5 A6 A7 A8 R9 R10 AL11 AL12

(a)

(b)

(c)

P1 P2 P3 I4 I5 A6 A7 A8 R9 R10


AL11 AL12

Instruction

27

Dimension 1 - Planning Dimension 2 - Instruction

Dimension 3 - Assessment, Reflection, Academic Language

Person Ability

Item Score

Threshold

Person Ability

Item Score

Threshold

Person Ability

Item Score Threshold

9

|

|

|

|

|

|

X |

|

|

8

|

|

|

X |

|

|

7

|

X |

|

X |

|

|

X |

X |

X |

6 X |

X |

X |

X |

|

X |

X |

X |

|

5 XX | P3.4 XX |

X | AL12.4

XX | P1.4 X | I4,4,I5.4 X | A6.4,AL11a.4

4 XX |

XXX |

X | R9.4

XXX | P2.4 XX |

XX | A8a.4

XXX |

XX |

XXX | A8b.4,R10.4

3 XXX |

XXX |

XXX | A7.4

XXX |

XXXX |

XXXX |

XXXXXXXX |

XXXXXXX |

XXXX |

2 XXXXXXX |

XXXXXX |

XXXXXXXX | AL11a.3

XXXXXXXXX |

XXXXXXX |

XXXXXXXXX | AL11b.3

1 XXXXXXXX |

XXXXX |

XXXXX |

XXXXXXX |

XXXXXXX |

XXXXXXXXX | A6.3,A8a.3

XXXXXX |

XXXXXXXXX |

XXXXXXX | A7.3,A8b.3,R9.3,R10.3,AL12.3

0 XXXXX | P3.3 XXXXXXX | I5.3 XXXXXXX |

XXXXXX |

XXXXXXX |

XXXXXXXX |

XXXXXXX | P2.3 XXXXXXX | I4.3 XXXXXXXXX |

-1 XXXXX | P1.3 XXXXXXX |

XXXXXXXXX |

XXXXXX |

XXXXX |

XXXXXXXX |

-2 XXXX |

XXXXXX |

XXXXXXXX |

XXXX |

XXXXXXXXX |

XXXX |

XXXXXXX |

XXX |

XXX |

-3 XXXXX |

XX |

X | AL11a.2

X |

XXX |

X | A8b.2,AL11b.2

X |

|

X | A8a.2

-4 X |

X |

X | R9.2,R10.2

X |

X |

X | A6.2,A7.2

-5

|

X | I5.2

| AL14.2

X |

|

|

| P3.2

|

|

-6

|

X | I4.2

|

| P1.2,P2.2

|

|

|

|

|

-7

|

|

|




Literacy

(AIC=

33058)

Confused?

Take aways •  Validation matters—it depends on particular uses and consequences

of the score data

•  There is evidence for unidimensional internal structure of PACT scale on production data to make license decision (who is “in” and who is “out” of the teaching in CA public schools)

•  There is evidence for person separation (r=.918) across unidimensional scale but rater effects (halo, drift) not yet well understood

•  There is evidence for multi-dimensionality not envisioned by developers which undermines strong claims by domain

•  Sub-score reporting and decomposition (e.g. Academic Language, Assessment, Reflection) not warranted given MD findings

Eternal Return: Validation to support responsible use

q Content Validity: Does the developers demonstrate coverage of curriculum with a sufficient amount of tasks for each topic area to ensure the meaningfulness of the score results?

q Response Processes Validity: Did the developers interview candidates to check how 9me, energy, mo9va9on, confusion, language facility, wri9ng ability, test-‐wiseness, etc. may have reduced/overstated the meaningfulness of the score results?

q  Internal Structure Validity: Does a sta4s4cal analysis of “dimensionality” of the teacher candidates’ results indicate that we are measuring what we intend to measure?

q Rela@ons to other external variables Validity: Do the PACT score results from correlate with those of other scores e.g. field placement scored from university supervisors and coopera@ng teachers ?

q Consequen@al Validity: Does the PACT event lead to “teaching to the test” or eliminate other “non-‐tested” content from teacher educa@on curriculum or otherwise harm the novice teacher learning experience?

Next steps: No short cuts 1.  Strong Theory Statements/Claims derived from latent regression analyses

to investigate effect of campus on person ability estimates are NOT warranted:

q  E.g., “CSU3”, “CSU4”, and “UC1” all have significantly higher mean ability estimates

q  E.g., “CSU1” has significantly lower mean ability estimates

2.  Problem of interpreting score results because of confounding of individual candidates, campus, program “treatments” and raters

q  I.e., Multi-level, multi-dimensional study must conducted with a richer data set to control for unobserved variation

3.  Current construct definitions and instrumentation may not detect individual differences in consistent or meaningful ways at “grain size” to offer formative feedback to candidates on Assessment or AL tasks

Thank you

For more info. contact: [email protected]

brent duckor ph.d. (sjsu) april 22, 2014bearcenter.berkeley.edu/sites/default/files/duckor &...

Documents