evaluating impacts of msp grants hilary rhodes, phd ellen bobronnikov february 22, 2010 common...

Evaluating Impacts of MSP Grants

Hilary Rhodes, PhD Ellen Bobronnikov

February 22, 2010

Common Issues and Recommendations

2

Overview

• Purpose of using Criteria for Classifying Designs of MSP Evaluations (“the rubric”) created by Westat through the Data Quality Initiative (DQI)

• Rubric’s key criteria for a rigorous design

• Common issues with evaluations

• Recommendations for more rigorous evaluations

3

Apply rubric to ensure reliable results.

•Projects meeting rubric criteria provide more accurate determination of impact on teacher and student outcomes

•Two step “screening” process of final year MSP evaluations to identify evaluations with rigorous designs.

4

First, assess evaluation design.

• To “qualify”, final year evaluations need to use an experimental or quasi-experimental design with a comparison group

• Of the 183 projects in their final year during PP07, 37 had qualifying evaluations with complete data

• Programs that did not qualify often used one-group only pre-post studies, which cannot account for differences occurring in the absence of the program

5

Next, use rubric to assess implementation.

• Second step is to apply rubric to see whether design implemented with sufficient rigor

• Rubric comprises six criterion:

1. Equivalence of groups at baseline

2. Adequate sample size

3. Use of valid & reliable measurement instruments

4. Use of consistent data collection methods

5. Sufficient response and retention rates

6. Reporting of relevant statistics

6

Criterion 1 – Baseline Equivalence

•Study demonstrates no significant differences between treatment and comparison groups at baseline (needed for quasi- experimental studies only)

•Purpose of Criterion – helps rule out alternative explanations for differences between groups

7


Common Issues:

•Baseline characteristics reported without a statistical test for differences

• Information critical for complete assessment of baseline equivalence (e.g., sample size, standard deviation) is missing


Recommendations:

• Report key characteristics associated with outcomes for each group (e.g., pretest scores, teaching experience). ALWAYS report sample sizes.

• Test for group mean difference on key characteristics with appropriate statistic tests (e.g., chi-square for dichotomous variables, t-test for continuous variables) and report the test statistics (e.g., t-stat, p-value).

• Control for significant differences between groups in the statistical analyses if differences exist at baseline.

9

Criterion 2 – Sample Size

• Sample size is adequate, based on a power analysis using:

– Significance level = 0.05

– Power = 0.8

– Minimum detectable effect informed by the literature or otherwise justified

• Alternatively, meet or exceed “rule of thumb” threshold sample sizes:

– Teacher outcomes: 12 schools or 60 teachers

– Student outcomes: 12 schools or 18 teachers or 130 students

• Purpose of Criterion – builds confidence in the results

10


Common Issues:

•Sample and subgroup sizes not reported for all teacher and student outcomes

•Sample sizes reported inconsistently across project documents

11


Recommendations:

•Always provide clear reporting of sample sizes for all groups and subgroups.

•Conduct a power analysis during the design stage and report results in evaluation.

• If you do not conduct a power analysis, ensure that you meet the threshold values.

12

Criterion 3 – Measurement Instruments

• Use existing instruments that have already been deemed valid and reliable to measure key outcomes,

OR

• Create new instruments that have either been:

– Sufficiently tested with subjects comparable to the study sample and found to be valid and reliable, OR

– Created using scales and items from pre-existing data collection instruments that has been validated and found to be reliable

• Purpose of Criterion – ensures that instruments used accurately capture the intended outcomes

13


Common Issues:

•Validity and reliability testing not reported for locally-developed instruments

•Results of validity or reliability testing on pre-existing instruments not reported

14


Recommendations:

• Select instruments that have been shown to produce accurate and consistent scores in a population similar to yours.

• If creating an assessment for the project, test the new instrument’s validity and reliability with a group similar to your subjects and report results.

• When selecting item from an existing measure:

– Describe previous work demonstrating that source produces valid, reliable scores;

– Provide references that describe instrument’s reliability & validity; and,

– Use full sub-scales where possible.

15

Criterion 4 – Data Collection Methods

•Methods, procedures, and timeframes used to collect the key outcome data from treatment and comparison groups are comparable

•Purpose of Criterion – limits possibility that observed differences can be attributed to factors besides the program, such as passage of time and differences in testing conditions

16


Common Issues:

•Data collected from groups at different times or not systematically

•Little information provided about data collection or process used to collect data from the treatment group only described

17


Recommendations:

• Document and describe the data collection procedures.

• Make every effort to collect data from the treatment and comparison groups for every outcome evaluated. If data cannot be collected from all members of both groups, consider randomly selecting a subset from each group.

18

Criterion 5 – Attrition

• Need to measure key outcomes for at least 70% of original sample (both treatment and control groups)

• If the attrition rates between groups equal or exceed15 percentage points, difference should be accounted for it in the statistical analysis

• Purpose of Criterion – helps ensure that sample attrition does not bias results as participants/control group members who drop out may systematically differ from those who remain

19


Common Issues:

• Sample attrition rates not reported, or reported for treatment group only

• Initial sample sizes not reported for all groups so that attrition rates could not be calculated

20


Recommendations:

•Report the number of units of assignment and analysis at the beginning and end of study.

• If reporting on sub-groups, indicate their pre & post sample sizes.

21

Criterion 6 – Relevant Statistics Reported

• Include treatment and comparison group post-test means and tests of significance for key outcomes OR,

• Provide sufficient information for calculation of statistical significance (e.g., mean, sample size, standard deviation/standard error)

• Purpose of Criterion – provides context for interpreting results, indicating where observed differences between groups are most likely larger than what chance alone might cause

22


Common Issue:

• Incomplete information made it difficult to assess evaluations for statistical significance. Common data points missing included means, standard deviations/ standard errors, and sample sizes

23


Recommendations:

• Report means, sample sizes, standard deviations/errors, for treatment and comparison groups on all key outcomes.

• Report results from appropriate significance testing of differences observed between groups (e.g., for continuous variables, report t-stats or p-values).

• If using a regression model or ANOVA analysis, describe the model and indicate means and standard deviations/errors.

Mathematics and Science Partnership (MSP) Programs

U.S. Department of Education

San Diego Regional Meeting

February 2010

evaluating impacts of msp grants hilary rhodes, phd ellen bobronnikov february 22, 2010 common...

Documents