differential item functioning on the desired results developmental profile assessment ... ·...

Differential item functioning on the Desired Results Developmental Profile Assessment for preschool students with

disabilities: what can (and should) we do about it?

Joshua Sussman

Postdoctoral Scholar

Berkeley Evaluation and Assessment Research (BEAR) Center

University of California, Berkeley

Outline

• Measurement invariance for students with disabilities• The case of the DRDP (an observational, formative measure of

early childhood development)• DIF analysis and interpretation• What should we do about DIF?

Fairness in assessment

• Measurement invariance is cast as an issue of fairness.• Comparable measurement across groups for fair and unbiased decision

making (e.g., identical cut scores)

• “Disabilities can make it difficult for students to engage in the intended test response processes, leading to test scores that do not reflect their underlying skills (Bolt & Ysseldike, 2008).”

• Special education eligibility for preschoolers (2004-2008) from NCES):• ~50% speech and language

• ~25% unspecified developmental disability

• ~6-7% Autism diagnosis

Desired Results Developmental Profile (DRDP)

• Multidimensional assessment of early childhood development• Attention to Learning and Self Regulation (ATL-REG)• Social and Emotional Development (SED)• Cognitive Development: Math and Science (COG)• Language and Literacy Development (LLD)• Physical Development and Health. (PDHLTH)

• Observational measure• A strengths-based formative assessment – not for sorting but for

increasing opportunity to learn.

Differential Item Functioning (DIF)• DIF methods have been used to study measurement invariance for

students with disabilities (Scarapati, Wells, Lewis, & Jirka, 2011).

• Different DIF methods include CTT methods (contingency table) and IRT-based approaches (Millsap & Everson, 1993)

• Common characteristics of some DIF studies (Ferne & Rupp, 2007): • Single method (sensitive to uniform DIF)• Aim for homogenous grouping variables• Matching on estimated latent variable (IRT studies)• Report on model and item fit (unidimensionality, conditional

independence, etc.).• Interpret statistical and practical significance of DIF

This study: Methods

• Unidimensional IRT-based approach for DIF detection• Masters (1982) partial credit model (PCM)

• Dimensionality examined previously

• Package ‘TAM’ in R (Robitzsch, Kiefer, & Wu, 2018)• !"#$ − &'#( + !"#$ ∗ &'#( + !"#$ ∗ &"#' ∗ &'#(

• N= 135,946 children enrolled in California preschools

• Facet model• Focal group: eligible for special education (n = 9,258)

• Reference: General education (n = 126,688)

Exploring the Rasch PCM (No DIF)

WLE reliability = 0.945

Wright mapProficiency distribution (WLE)

Conditional SEM

Results: DIF Model

• Model fit: LR test for PCM vs. DIF PCM

ATL-REG SED COG LLD PD-HLTH

Chi2 1215 2172 2324 4895 2220

Df 28 35 64 69 73

p < 0.000 < 0.000 <0.000 <0.000 <0.000

• Similar story for AIC, BIC, CAIC, etc…

Results: Item Fit (under construction)

Estimated differences in group proficiency between special and and general education students (Logits)

Dimension Difference SE Difference*2ATL-REG 0.285 0.002 0.571SED 0.490 0.002 0.981COG 0.352 0.001 0.705LLD 0.356 0.001 0.711PD-HLTH -0.021 0.001 -0.042

Positive values indicate higher ability in the general education group

Difference between Sped and Non-sped

Easier for those in special

education

Harder for those in special

education

(under construction– add effect sizes from Paek & Wilson)

Highest-DIF item (easier for students in special education)

• Established link between language disorder (50% of the SpEd sample) and difficulties with attention and executive functioning (Meuller& Tomblin, 2012)

Item infit = 0.93

Highest positive-DIF item (harder for Sped students)

• Fine motor delays are a common sequelae among students with a variety of disabilities, including difficulties with speech and language (Brookman, Macdonald, Macdonald, & Bishop, 2013)

Item infit = 0.97

What to do about items displaying DIF?

• Remove• Politically difficult

• May impact the construct

• Revise• Change the item prompt or anchors (similar issues as Removal)

• Change the rater training

• Leave• Additional psychometric development

• Produce different tests for different groups (construct representation issues)

• Model dimensionality, rater effects, other covariates

Thanks!

Appendix

No step DIF model

Item*sped*step

DIF

• For test administrations to be considered valid for all student groups, there must be comparable measurement across groups. This can ensure that decisions based on test results are made in a fair manner for all students.

• examinations of score comparability across various student groups and various testing conditions are considered an important piece of evidence suggesting that a test will lead to fair and unbiased decision making (AERA, APA, & NCME, 1999; Braden & Niebling, 2006)

• Several empirical studies of accommodations provided to students with physical disabilities were conducted in the 1980s using this approach (Bennett, Rock, & Jirele, 1987; Bennett, Rock, & Kaplan, 1987; Bennett, Rock, & Novatkoski, 1989; Rogers, 1983). In general, these researchers found that accommodated test administrations for students with sensory/physical disabilities tended to show limited DIF. Recently, several research teams have been using this approach to examine the validity of accommodations for students with mental disabilities, some using factor analysis to examine measurement comparability (Huynh & Barton, 2006; Pomplun & Omar, 2000), others using analysis of DIF (Bolt & Ysseldyke, 2006; Lewis, Green, & Miller, 1999), with results varying in terms of the extent to which measurement comparability has been identified.

• Examining sped DIF• Why dif matters for Sped• DRDP Ax early Cx Dev• Sped in preschoolers in particular

• 50% SLI, 25% “developmental disability,” 6% Autism (NCER)

• Dif methods in the literature.• DIF Ax• Results= DIF

• Model fit• Item fit = OK• DIF plot

• Interpreting the DIF• Pros and cons

• What to do about the DIF• This is where it gets real– these Ax are in use right now.

DRDP and students with disabilities

• What is DRDP• Students with disabilities.. • DRDP supports students.. Students with disabilities too

Students with disabilities

• “Disabilities can make it difficult for students to engage in the intended test response processes, leading to test scores that do not reflect their underlying skills. “ X & Ysseldyke, • Preschool eligibility is typically for severe issues.

Methods

• Sample • N=• Sped n

Equation

• Equation 1 represents the baseline comparison model and includes only the category threshold and a slope term for the theta estimate. DIF was evaluated by comparing the fit (−2 log likelihood) between Equation 1 and Equation 2 (uniform DIF) and between Equations 2 and 3 (nonuniform DIF). If the differences in model fit were statistically significant, then DIF was detected.

Evaluation

• Model fit• Effect sizes were also considered when determining the

meaningfulness of DIF.

results

• n/% of items (steps?) with small DIF• n/% with moderate, large

• n/% of items with fit.

Purify

• The impact of DIF was evaluated by comparing risk estimates (i.e., person theta estimates) from a model including all items and a model adjusted for DIF. The method of adjusting for DIF is sometimes referred to as purifying or resolving items that exhibit DIF (Zumbo, 1999). To purify an item that exhibits DIF, the model was adjusted to include group-specific IRT parameters for those items (i.e., separate item characteristic curves were estimated for each group). The resulting purified IRT theta estimates were anchored to the non-DIF items providing an estimate of risk that is free of DIF. Theta estimates for the two models were equated

• To test the individual-level impact of DIF, each student’s naive theta estimates (from the model based on all items) was compared with their purified theta estimate. Differences in theta estimates greater than the median standard error (of the naive theta estimates) were considered to represent meaningful individual-level impact of DIF (Choi et al., 2011). Differences in theta estimates were also compared with each individual student’s naive standard error (i.e., uncertainty of initial score)…

• ALSO SHOW CATEGORIES

Good development

• Working with EESD

• DIF• What are the items?• What is the profile of students who are identified with disabilities as preschoolers? Set

this up in the beginning or the end?• Fine motor • Activity level• Two stories– most items show no DIF. • The items that do show DIF are reflecting known issues.• Examining the comorbidity of language disorders and ADHD (50% of those id’d are SLI –

large comorbidity), 25% are reported as having a developmental delay– fine motor and activity level being more

• Pre-Elementary Education Longitudinal Study (PEELS)

https://nces.ed.gov/datalab/

differential item functioning on the desired results developmental profile assessment ... ·...

Documents