the impact of high-stakes testing on student achievement

80
Preliminary and Incomplete: Comments Welcome THE IMPACT OF HIGH-STAKES TESTING ON STUDENT ACHIEVEMENT: EVIDENCE FROM CHICAGO Brian A. Jacob John F. Kennedy School of Government Harvard University June 2001 I would like to thank the Chicago Public Schools, the Illinois State Board of Education and the Consortium on Chicago School Research for providing the data used in this study. I am grateful to Anthony Bryk, Carolyn Hill, Robert LaLonde, Lars Lefgren, Steven Levitt, Helen Levy, Susan Mayer, Melissa Roderick, Robin Tepper and seminar participants at the University of Chicago for helpful comments and suggestions. Funding for this research was provided by the Spencer Foundation. All remaining errors are my own.

Upload: others

Post on 11-Feb-2022

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The Impact of High-Stakes Testing on Student Achievement

Preliminary and Incomplete: Comments Welcome

THE IMPACT OF HIGH-STAKES TESTING ON STUDENT ACHIEVEMENT:

EVIDENCE FROM CHICAGO∗

Brian A. Jacob

John F. Kennedy School of Government

Harvard University

June 2001

∗ I would like to thank the Chicago Public Schools, the Illinois State Board of Education and the Consortium on Chicago School Research for providing the data used in this study. I am grateful to Anthony Bryk, Carolyn Hill, Robert LaLonde, Lars Lefgren, Steven Levitt, Helen Levy, Susan Mayer, Melissa Roderick, Robin Tepper and seminar participants at the University of Chicago for helpful comments and suggestions. Funding for this research was provided by the Spencer Foundation. All remaining errors are my own.

Page 2: The Impact of High-Stakes Testing on Student Achievement

THE IMPACT OF HIGH-STAKES TESTING ON STUDENT ACHIEVEMENT:

EVIDENCE FROM CHICAGO

Abstract

School reforms designed to hold students and teachers accountable for student achievement have become increasingly popular in recent years. Yet there is little empirical evidence on how such policies impact student or teacher behavior, or how they ultimately affect student achievement. This study utilizes detailed administrative data on student achievement in Chicago to examine the impact of a high-stakes testing policy introduced in 1996. I find that math and reading scores on the high-stakes exam increased sharply following the introduction of the accountability policy, but that there was little if any effect on a state-administered, low-stakes exam. At the same time, science and social studies scores on both exams leveled off or declined under the new regime. There is also evidence that the policy increased retention rates in primary grades that were not subject to the promotional criteria.

Page 3: The Impact of High-Stakes Testing on Student Achievement

1. Introduction

School reforms designed to hold students and teachers accountable for student

achievement have become increasingly popular in recent years. Statutes in 19 states explicitly

link student promotion to performance on a state or district assessment (ECS 2000). The largest

school districts in the country, including New York City, Los Angeles, Chicago and Washington,

D.C., have recently implemented policies requiring students to attend summer school and/or

repeat a grade if they do not demonstrate sufficient mastery of basic skills. At the same time, 20

states reward teachers and administrators on the basis of exemplary student performance and 32

states sanction school staff on the basis of poor student performance. Many states and districts

have passed legislation allowing the takeover or closure of schools that do not show

improvement (ECS 2000).

Despite the increasing popularity of high-stakes testing, there is little evidence on how

such policies influence student or teacher behavior, or how they ultimately affect student

achievement. Standard incentive theory suggests that high-stakes testing will increase student

achievement in the measured subjects and exams, but also may lead teachers to shift resources

away from low-stakes subjects and to neglect infra-marginal students. Such high-powered

incentives may also cause students and teachers to ignore less easily observable dimensions of

student learning such as critical thinking in order to teach the narrowly defined basic skills that

are tested on the high-stakes exam (Holmstrom 1991).

Chicago was one of the first large, urban school districts to implement a comprehensive

high-stakes accountability policy. Beginning in 1996, Chicago Public Schools in which fewer

than 15 percent of students met national norms in reading were placed on probation, which

entailed additional resources as well as oversight. If student performance did not improve in

these schools, teachers and administrators were subject to reassignment or dismissal. At the

same time, the CPS took steps to end “social promotion,” the practice of passing students to the

3

Page 4: The Impact of High-Stakes Testing on Student Achievement

next grade regardless of their academic ability. Students in third, sixth and eighth grades (the

“gate” grades) were required to meet minimum standards in reading and mathematics on the

Iowa Test of Basic Skills (ITBS) in order to advance to the next grade.

This paper utilizes detailed administrative data on student achievement in Chicago to

examine the impact of the accountability policy on student achievement.1 I first compare

achievement trends before and after the introduction of high-stakes testing, conditioning on

changes in observable student characteristics such as demographics and prior achievement. I

then compare the achievement trends for students and schools who were likely to be strongly

influenced by high-stakes testing (e.g., such as low-achieving students and probation schools)

with trends among students and schools less affected by the policy. To the extent that these

high-achieving students and schools were not influenced by the policy, this difference-in-

difference strategy will eliminate any common, unobserved time-varying factors that may have

influenced overall achievement levels in Chicago. In order to assess the generalizability of

achievement changes in Chicago, I compare achievement trends on the Iowa Test of Basic Skills

(ITBS), the exam used for promotion and probation decisions, with comparable trends on a state-

mandated, low-stakes exam, the Illinois Goals Assessment Program (IGAP). Using a panel of

Illinois school data during the 1990s, I am able to compare IGAP achievement trends in Chicago

with other urban districts in Illinois, thereby controlling for unobserved statewide factors, such

as improvements in the economy or changes in state or federal educational policies. Finally, I

examine several unintended consequences of the policy, such as changes in retention rates and

achievement levels in grades and subjects not directly influenced by the policy.

I find that ITBS math and reading scores increased sharply following the introduction of

high-stakes testing, but that the pre-existing trend of IGAP scores did not change. There is some

1 In this analysis, I do not focus on the effect of the specific treatments that accompanied the accountability policy such as the resources provided to probation schools, summer school and grade retention. For an evaluation of the probation, summer school and grade retention programs, see Jacob and Lefgren (2001).

4

Page 5: The Impact of High-Stakes Testing on Student Achievement

evidence that test preparation and short-term effort may account for a portion of the differential

ITBS-IGAP gains. While high-stakes testing did not have any impact on the likelihood that

students would transfer to a private school, move out of the CPS to a different public school

districts (in the suburbs or out of the state) or drop out of school, it appears to have increased

retention rates in the primary grades and decreased relative achievement in science and social

studies.

The remainder of this paper is organized as follows. Section 2 lays out a conceptual

framework for analyzing high-stakes testing, providing background on the Chicago reforms and

reviewing the previous literature on accountability and achievement. Section 3 discusses the

empirical strategy and Section 4 describes the data. Sections 5 to 9 present the main findings.

Section 10 summarizes the major findings and discusses several policy implications.

5

Page 6: The Impact of High-Stakes Testing on Student Achievement

2. A Conceptual Framework for Analyzing High-Stakes Testing

2.1 Background on High-Stakes Testing in Chicago

In 1996 the CPS introduced a comprehensive accountability policy in an effort to raise

academic achievement. The first component of the policy focused on holding students

accountable for learning, ending a common practice known as “social promotion” whereby

students are advanced to the next grade regardless of their ability or achievement. Under this

policy, students in third, sixth and eighth grades are required to meet minimum standards in

reading and mathematics on the Iowa Test of Basic Skills (ITBS) in order to advance to the next

grade. Students that do not make the standard are required to attend a six-week summer school

program, after which they retake the exams. Those who pass move on to the next grade.

Students who again fail to meet the standard are required to repeat the grade, with the exception

of 15-year-olds who attend newly created “transition” centers.

One of the most striking features of Chicago’s social promotion policy was its scope.

Table 2.1 illustrates the number of students affected by the policy each year.2 The ITBS scores

are measured in terms of “grade equivalents” (GEs) that reflect the years and months of learning

a student has mastered. The exam is nationally normed so that a student at the 50th percentile in

the nation scores at the eighth month of her current grade – i.e., an average third grader will

score a 3.8. In first full year of the policy, the promotion standards for third, sixth and eighth

grade were 2.8, 5.3, and 7.0 respectively, which roughly corresponded to the 20th percentile in

the national achievement distribution. The promotional criteria were raised for eighth graders in

1997-98 and 1998-99, and for all three grades in 1999-2000.

Because many Chicago students are in special education and bilingual programs, and are

thereby exempt from standardized testing, only 70 to 80 percent of the students in the system

6

Page 7: The Impact of High-Stakes Testing on Student Achievement

were directly affected by the accountability policies.3 Of those who were subject to the policy,

nearly 50 percent of third graders and roughly one-third of sixth and eighth graders failed to

meet the promotional criteria and were required to attend summer school in 1996-97. Of those

who failed to meet the promotional criteria in June, approximately two-thirds passed in August.

As a result, roughly 20 percent of third grade students and 10 to 15 percent of sixth and eighth

grade students were eventually held back in the Fall.

In conjunction with the social promotion policy, the CPS instituted a policy designed to

hold teachers and schools accountable for student achievement. Under this policy, schools in

which fewer than 15 percent of students scored at or above national norms on the ITBS reading

exam are placed on probation. If they do not exhibit sufficient improvement, these schools may

be reconstituted, which involves the dismissal or reassignment of teachers and school

administrators. In 1996-97, 71 elementary schools serving over 45,000 students were placed on

academic probation.4

2.2 Prior Research on High-Stakes Testing

Studies of high school graduation tests provide mixed evidence on the effect of high-

stakes testing. While several studies found a positive association between student achievement

and minimum competency testing (Bishop 1998, Frederisksen 1994, Neill 1998, Winfield, 1990),

a recent study with better controls for prior student achievement finds no effect (Jacob 2001).

There is also evidence that mandatory high school graduation exams increase the probability of

dropping out of school, particularly among low-achieving students (Catterall, 1987, Kreitzer

2 Note that Fall enrollment figures increase beginning in 1997-98 because students who were retained in previous years are counted in the next Fall enrollment for the same grade. 3 Section 5 examines whether the number of students in bilingual and special education programs, or otherwise excluded from testing, has changed since the introduction of high-stakes testing.

7

Page 8: The Impact of High-Stakes Testing on Student Achievement

1989, Catterall 1989, MacMillan 1990, Griffin 1996, Reardon 1996, Jacob 2001 #363). Craig

and Sheu (1992) found modest improvements in student achievement after the implementation of

a school-based accountability policy in South Carolina in 1984, but Ladd (1999) found few

achievement effects for a school-based accountability and incentive program in Dallas during the

early 1990s.

The recent experience in Texas provides additional evidence regarding the impact of

high-stakes accountability systems. In the early 1990s, Texas implemented a two-tiered

accountability system, including a graduation requirement for high school as well as a system of

rewards and sanctions for schools based on student performance. Texas students made dramatic

gains on the Texas Assessment of Academic Skills (TAAS) from 1994 to 1998 (Haney 2000,

Klein 2000). In their 1998 study of state-level NAEP trends, Grissmer and Flanagan (1998)

found that Texas was one of the fastest improving states during the 1990s and that conditional on

family characteristics, Texas made the largest gains of all states. However, in an examination of

student achievement in Texas, Klein et. al. (2000) found that TAAS gains were several times

larger than NAEP gains during the 1990s, and that over a four-year period the average NAEP

gains in Texas exceeded those of the nation in only one of the three comparisons—fourth grade

math, but not fourth grade reading or eighth grade math.5

Finally, there is some evidence that high-stakes testing leads teachers to neglect non-

tested subjects and less easily observable dimensions of student learning, leading to test-specific

achievement gains that are not broadly generalizable. Linn and Dunbar (1990) found that many

states made smaller gains on the National Assessment of Educational Progress (NAEP) than their

4 Probation schools received some additional resources and were more closely monitored by CPS staff. Jacob and Lefgren [, 2001 #371] examined the impact of monitoring and resources provided to schools on probation, using a regression discontinuity design that compared the performance of students in schools that just made the probation cutoff with those that just missed the cutoff. They found that the additional resources and monitoring provided by probation had no impact on math or reading achievement. 5 It is difficult to compare TAAS and NAEP scores because the exams were given in different years. Fourth grade reading is the only grade-subject for which a direct comparison is possible since Texas students took both the TAAS

8

Page 9: The Impact of High-Stakes Testing on Student Achievement

and NAEP in 1994 and 1998. Fourth and eighth grade students took the NAEP math exams in 1992 and 1996, but did not start taking the TAAS until 1994.

9

Page 10: The Impact of High-Stakes Testing on Student Achievement

own achievement exams, presumably due at least in part to the larger incentives placed on the

state exams.6 Koretz et. al. (1991) specifically examined the role of high-stakes incentives on

student achievement in two state testing programs in the 1980s. They found that scores dropped

sharply when a new form of test was introduced and then rose steadily over the next several

years as teachers and students became more familiar with the exam, suggesting considerable test-

specific preparation. Koretz and Barron (1998) found that gains on the state-administered exam

in Kentucky during the early 1990s were substantially larger than gains on the nationally-

administered NAEP exam and that the NAEP gains were roughly comparable to the national

average. Stecher and Barron (1999) documented that teachers in Kentucky allocated

instructional time differently across grades to match the emphasis of the testing program.

3. Empirical Strategy

Because the CPS instituted the accountability policy district-wide in 1996, it is difficult

to disentangle the effect of high-stakes testing from shifts in the composition of Chicago public

school students (e.g., an increase in middle-class students attending public schools in Chicago),

changes in state or federal education policy (e.g., the federal initiative to reduce class sizes) or

changes in the economy (e.g., reduced unemployment rates in the late 1990s). This section

describes several different sources of variation I exploit to identify the effect of the policy.

To determine the effect of high-stakes testing on student achievement, I estimate the

following education production function:

(3.1) isdtdtdtsdtisdtdtisdt ZXHighStakesy εφηγββδ ++++++= 21)(

10

6 In 1987, Cannell (1987) noted that a disproportionate number of states and districts report being above the national norm on their own tests, dubbing this phenomenon the “Lake Wobegon” effect (Cannell 1987, Linn 1990, Shepard 1990, Shepard 1988).

Page 11: The Impact of High-Stakes Testing on Student Achievement

where y is an achievement score for individual i in school s in district d at time t, X is a vector of

student characteristics, Z is a vector of school and district characteristics and ε is a stochastic

error term. Unobservable factors at the state and national level are captured by time (γ ), district

(η ) and time*district (φ ) effects.

Using detailed administrative data for each student, I am able to control for observable

changes in student composition, including race, socio-economic status and prior achievement.

Because achievement data is available back to 1990, six years prior to the introduction of the

policy, I am also able to account for pre-existing achievement trends within the CPS. This short,

interrupted time-series design [Ashenfelter, 1978 #406] accounts for continuing improvement

stemming from earlier reform efforts.7 In the absence of any unobserved time or district*time

effects, this simple pre-post design will provide an unbiased estimate of the policy.

In order to control for unobservable changes over time that affect student achievement,

one might utilize a difference-in-difference strategy which compares the achievement trends of

students who were likely to have been strongly influenced by the policy with the trends of

students who were less likely to be influenced by the policy. While no students were

exogenously excluded from the policy, it is likely that low-achieving students were more

strongly influenced by the policy than their peers, since they were disproportionately at-risk for

retention and were disproportionately located in schools at risk for probation. Thus, if we find

that after 1996 achievement levels of low-achieving students increased more sharply than high-

achieving students, we might conclude that the policy had a positive impact on achievement.

This model can be implemented by estimating the following specification

(3.2) isdtdtdtsdtisdt

dtdtdtisdt

ZXLowHighStakesLowHighStakesy

εφηγββδδδ

++++++++=

21

321 )*()()(

7 The inclusion of a linear trend implicitly assumes that any previous reforms that raised student performance would have continued with the same marginal effectiveness in the future. If this assumption is not true, the estimates may be biased downwards. In addition, this aggregate trend assumes that there are no school-level composition changes

11

Page 12: The Impact of High-Stakes Testing on Student Achievement

where Low is a binary variable that indicates a low prior achievement level.

There are two major drawbacks to this strategy. First, it relies on the assumption that, in

the absence of the policy, unobserved time trends would have been equal across achievement

groups (i.e., γγγ == HighLow ). Second, it assumes a homogeneous policy effect (i.e.,

δδδ == HighLow ). In this case, the policy effect is likely to be a function of not only the

incentives generated by the policy, but the capacity to respond to the incentives (i.e., ( )aδδ =

where a is a measure of prior achievement). It is likely that higher achieving students and

schools have a greater capacity to respond to the policy (i.e., more parental support, higher

quality teachers, more effective principal leadership, etc.), in which case this comparison will

underestimate the effect of the policy. Moreover, high-achieving students may differ from low-

achieving students in other ways that might be expected to influence achievement gains (e.g.,

risk preference). As we see in Section 6, it turns out that the policy effects are larger in low-

achieving schools, but there is no consistent pattern in policy effects across student prior

achievement groups, which suggests that incentives and capacity may have offset each other.

To assess the generalizability of achievement gains under the policy, I compare

achievement trends on the ITBS with performance trends on the Illinois Goals Assessment

Program (IGAP), a state-mandated achievement exam that is not used for accountability

purposes in Chicago. Because these exams are measured in different metrics, I standardize the

outcome measures using the mean and standard deviation from the base year.

Regardless of whether one uses the ITBS or IGAP as an outcome, the primary

disadvantage of comparing the achievement of Chicago students in different years is that it does

not take into account changes that were taking place outside the CPS during this period. It is

possible that state or national education reforms, or changes in the economy, were responsible

12

in Chicago. I test this assumption by including school-specific fixed effects and school-specific trends in certain specifications.

Page 13: The Impact of High-Stakes Testing on Student Achievement

for the increases in student performance in Chicago during the late nineties. Indeed, Grissmer

and Flanagan (1998) show that student achievement increased in many states during the 1990s.

If this were the case, the estimates from equation 3.1 would overstate the impact of high-stakes

testing in Chicago.

Using a panel of IGAP achievement data on schools throughout Illinois during the 1990s,

I am able to address this concern. By comparing the change over time in Chicago with the

change over time in the rest of Illinois, one can eliminate any unobserved common time trends.

By comparing Chicago to other low-income, urban districts in Illinois, we eliminate any

unobserved trends operating within districts like Chicago. To implement this strategy, I estimate

the following model:

(3.3) sdt

sdt

ZXPostUrbanPostChicagoUrbanChicagoPosty

εββδδ

δδδ

++++

+++=

2154

321

)*()*()()()(

where y is the average reading or math score for school s in district d at time t, Post is an

indicator that takes on the value of one after 1996 (when high-stakes testing was introduced in

Chicago) and zero otherwise, and Chicago and Urban are binary variables that indicate whether

the school is in Chicago or any low-income, urban district in Illinois. X captures school average

demographics and Z reflects district level demographic variables. This basic model can be

extended to account for pre-existing achievement trends in each district and to examine the

variation in policy effects across the distribution of school achievement (i.e., whether the

differential gain between low-achieving schools in Chicago and Illinois is larger than the

differential gains between high-achieving schools in Chicago and Illinois).

4. Data

This study utilizes detailed administrative data from the Chicago Public Schools as well

as the Illinois Board of Education. CPS student records include information on a student’s

13

Page 14: The Impact of High-Stakes Testing on Student Achievement

school, home address, demographic and family background characteristics, special education and

bilingual placement, free lunch status, standardized test scores, grade retention and summer

school attendance. CPS personnel and budget files provide information on the financial

resources and teacher characteristics in each school and school files provide aggregate

information on the school population, including daily attendance rates, student mobility rates and

racial and SES composition.

The measure of achievement used in Chicago is the Iowa Test of Basic Skills (ITBS), a

standardized, multiple-choice exam developed and published by the Riverside Company. The

exam is nationally normed and student scores are reported in a grade equivalent metric that

reflects the number of years and months of learning. A student at the 50th percentile in the nation

scores at the eighth month of her current grade – i.e., an average third grader will score a 3.8.

However, because grade equivalents present a number of well-known shortcomings for

comparisons over time or across grades,8 I use an alternative outcome metric derived from an

item-response model. By taking advantage of the common items across different forms and

levels of the exam, these measures provide an effective way to compare students in different

grade levels or taking different forms of the exam.9

The primary sample used in this analysis consists of students who were in grades three to

eight for the first time in 1993 to 1999 and who were tested and included for official reporting

8 There are three major difficulties in using grade equivalents (GE) for over time or cross grade comparisons. First, different forms of the exam are administered each year. Because the forms are not correctly equated over time, one might confound changes in test performance over time with changes in form difficulty. Second, because the grade equivalents are not a linear metric, a score of 5.3 on level 12 of the exam does not represent the same thing as a score of 5.3 at level 13. Thus, it is difficult to accurately compare the ability of two students if they are in different grades. Finally, the GE metric is not linear within test level because the scale spreads out more at the extremes of the score distribution. For example, one additional correct response at the top or bottom of the scale can translate into a gain of nearly one full GE whereas an additional correct answer in the middle of the scale would result in only a fraction of this increase. 9 Models based on Item Response Theory (IRT) assume that the probability that student i answers questions j correctly is a function of the student ability and the item difficulty. In practice, one estimates a simple logit model in which the outcome is whether or not student correctly answers question i j . The explanatory variables include an indicator variable for each question and each student. The difficulty of the question is given by the coefficient on the appropriate indicator variable and the student ability is measured by the coefficient on the student indicator variable. The resulting metric is calibrated in terms of logits. This metric can be used for cross-grade comparisons

14

Page 15: The Impact of High-Stakes Testing on Student Achievement

purposes.10 I limit my analysis to first-time students because the implementation of the social

promotion policy caused a large number of low-performing students in third, sixth and eighth

grade to be retained, which substantially changed the student composition in these and

subsequent grades beginning in 1997-98.11 By limiting the analysis to cohorts beginning in

1993, I am able to control for at least three years of prior achievement scores for all students.12 I

do not examine cohorts after 1999 because the introduction of the social promotion substantially

changed the composition of these groups. For example, nearly 20 percent of the original 2000

sixth grade cohort was retained in 1997 when they experienced high-stakes testing as third

graders.13

Table 4.1 presents summary statistics for the sample. Like many urban districts across

the country, Chicago has a large population of minority and low-income students. In the sixth

grade group, for example, roughly 56 percent of students are Black, 30 percent are Hispanic and

over 70 percent receive free or reduced price lunch. The percent of students in special education

and bilingual programs is substantially lower in this sample than for the entire CPS population

because many students in these programs are not tested, and are thus not shown here. We see

that a relatively small number of students in these grades transfer to a private school (one

percent), move out of Chicago (five percent) or drop out of school (one percent). The rates

as long as the exams in question share a number of common questions (Wright 1979). I thank the Consortium on Chicago School Research for providing the Rasch measures used in this analysis. 10 As explained in greater detail in Section 5, certain bilingual and special education students are not tested or, if tested, not included for official reporting purposes. 11 While focusing on first-timers allows a consistent comparison across time, it is still possible that the composition changes generated by the social promotion policy could have affected the performance of students in later cohorts. For example, if first-timers in the 1998 and 1999 cohorts were in classes with a large number of low-achieving students who had been retained in the previous year (in addition to the low-achieving students in the grade for the first time), they might perform lower than otherwise expected. This would bias the estimates downward. 12 Because a new series of ITBS forms were introduced in 1993, focusing on these cohorts also permits the cleanest comparisons of changes over time. For students missing prior test score information, I set the missing test score to zero and include a variable indicating the score for this period was missing.

15

13 For this reason, the eighth grade sample only includes cohorts from 1993 to 1998. The 1999-2000 cohort is not included in the third grade sample because there is evidence that retention rates in first and second grade increased substantially after the introduction of high-stakes testing even though these grades were not directly affected by the social promotion policy.

Page 16: The Impact of High-Stakes Testing on Student Achievement

among eighth graders are substantially higher than the other elementary grades, though not

particularly large in comparison to high school rates.

5. The Impact of High-Stakes Testing on Testing Patterns

While the new accountability policies in Chicago are designed to increase student

achievement, high-stakes testing creates incentives for students and teachers that may change

test-taking patterns. For example, if students are required to pass the ITBS in order to move to

the next grade, they may be less inclined to miss the exam. On the other hand, under the

probation policy teachers have an incentive to dissuade low-achieving students from taking the

exam.14 In a recent descriptive analysis of testing patterns in Chicago, Easton et al. (2000, 2001)

found that the percent of CPS students who are tested and included15 for reporting purposes has

declined during the 1990s, particularly following the introduction of high-stakes testing.

However, they attribute this decline to changing demographics (specifically, an increase in

bilingual students in the system) and administrative changes in the testing policy16 rather than to

the new high-stakes testing program. Shifts in testing patterns are important not only because

they may indicate if some students are being neglected under the policy, but also because the

14 Schools are not explicitly judged on the percentage of their students who take the exams, although it is likely that a school with an unusually high fraction of students who miss the exam would fall under suspicion. 15 Students in Chicago fall into one of three testing categories: (1) tested and included, (2) tested and excluded, and (3) not tested. Included versus excluded refers to whether the student’s test scores count for official reporting purposes. Students who are excluded for reporting are not subject to the social promotion policy and their scores do not contribute to the determination of their school’s probation status. The majority of students in grades two through eight take the ITBS, but roughly ten percent of these students are excluded from reporting on the basis of participation in special education or bilingual programs. A third group of students is not required to take the ITBS at all because of a bilingual or special education classification. Students with the least severe special education placements and those who have been in bilingual programs for more than four years are generally tested and included. Those with more severe disabilities and those who have been in bilingual programs three to four years are often tested but excluded. Finally, those students with severe learning disabilities and those who have only been in bilingual programs for one or two years are generally not tested. 16 Prior to 1997, the ITBS scores of all bilingual students who took the standardized exams were included for official reporting purposes. During this time, CPS testing policy required students enrolled in bilingual programs for more than three years to take the ITBS, but teachers were given the option to test other bilingual students. According to school officials, many teachers were reluctant to test bilingual students, fearing that their low scores would reflect poorly on the school. Beginning in 1997, CPS began excluding the ITBS scores of students who had been enrolled in bilingual programs for three or fewer years to encourage teachers to test these students for

16

Page 17: The Impact of High-Stakes Testing on Student Achievement

may bias estimates of the achievement gains under the policy. Building on the work of Easton

and his colleagues, this analysis not only controls for changes in observable student

characteristics over time and pre-existing trends within the CPS, but also separately considers

the trends for gate versus non-gate grades and low versus high achieving students.

5.1 Trends in Testing and Exclusion Rates in Gate Grades

Table 5.1 presents OLS estimates of the relationship between the probability of (a)

missing the ITBS and (b) being excluded from reporting conditional on taking the ITBS.17

Controls include prior achievement, race, gender, race*gender interactions, neighborhood

poverty, age, household composition, a special education indicator and a linear time trend.

Robust standard errors that account for the correlation of errors within school are shown in

parentheses. Estimates for Black students are shown separately because these students were

unlikely to have been affected by the changes in the bilingual policy.

The results suggest that high-stakes testing increased test-taking rates, particularly among

eighth grade students. For Black eighth grade students, the policy appears to have decreased the

probability of missing the ITBS between 2.8 and 5.7 percentage points, a decline of 50 to 100

percent given the baseline rate of 5.7 percent. These effects were largest among high-risk

students and schools (tables available from author upon request). In contrast, high-stakes testing

appears to have had little if any impact on the probability of being excluded from test reporting.

The probability of being excluded did increase slightly for Black students in eighth grade under

high-stakes testing, although this may be biased upward due to the increase in test-takers.18

There were no effects for third and sixth grade students.

diagnostic purposes. In 1999, the CPS began excluding the scores of fourth year bilingual students as well, but also began requiring third-year bilingual students to take the ITBS exams. 17 Probit estimates yield virtual identical results so linear probability estimates are presented for ease of interpretation.

17

18 If the students induced to take the ITBS under the high-stakes regime were more likely to be excluded, the exclusion estimates would be biased upward. To control for possible unobservable characteristics that are

Page 18: The Impact of High-Stakes Testing on Student Achievement

5.2 Trends in Testing and Exclusion Rates in Non-Gate Grades

While the social promotion policy only directly affected students in third, sixth and

eighth grades, the probation policy may have affected students in all grades since schools are

judged on the basis of student ITBS scores in grades three through eight. Because ITBS scores

do not have direct consequences for students in non-gate grades, they provide a cleaner test of

behavioral responses on the part of teachers. Probation provides teachers an incentive to exclude

low-achieving students and to encourage high-achieving students to take the exam.

While the probation policy may have influenced teacher behavior in each year after 1996,

it may be difficult to identify the effects after the first year of the policy because the introduction

of the social promotion policy in grades three and six in 1997 influenced the composition of

students in the fourth, fifth, and seventh grades beginning in the 1997-98 school year. To avoid

these selection problems, I limit the analysis to the first cohort to experience the probation

policy. Table 5.3 presents OLS estimates of the probability of not being tested and excluded for

Black students in fourth, fifth and seventh grade in 1997. The average effect shown in the top

row suggests that the accountability policy decreased the likelihood of missing the ITBS, but did

not significantly increase the probability of being excluded (with the exception of fifth grade,

where the effect was statistically significant, but substantively small). The second panel shows

that the largest decreases in missing the ITBS took place in low achieving schools. In the third

panel, we see that the likelihood of being excluded increased for low-achieving students but

decreased for high-achieving students.

18

correlated with exclusion and the probability of taking the exam under the high-stakes testing regime, I include five variables indicating whether the student was excluded from test reporting in the each of the previous five years that he/she took the exam. To further test the sensitivity of the exclusion estimates to changes in the likelihood of taking the ITBS, I estimate each model under the alternative assumptions that exclusion rates among those not tested are equal to, twice as high and 50 percent as high as the rate among those tested in the same school prior to high-stakes testing. The results are robust to these assumptions.

Page 19: The Impact of High-Stakes Testing on Student Achievement

6. The Impact of High-Stakes Testing on Math and Reading Achievement

Having examined the effect of high-stakes testing on test-taking and exclusion patterns,

we now turn to the effect on math and reading achievement. Figures 6.1 and 6.2 show the trends

in ITBS math and reading scores for third, sixth and eighth grades from 1990 to 2000. Because

the social promotion policy was instituted for eighth graders one year before it was instituted for

third and sixth graders, I have rescaled the horizontal axis to reflect the years before or after the

introduction of high-stakes testing. Recall that the sample includes only students who were in

third, sixth or eighth grade for the first time (i.e., no retained students). Also note that the trend

for eighth grade students ends in 1998 because the composition of eighth grade cohorts changed

substantially in 1999 when the first wave of sixth graders to experience the grade retentions in

1997 reached the eighth grade. The achievement trends rise sharply after the introduction of

high-stakes testing for both math (Figure 6.1) and reading (Figure 6.2). In mathematics,

achievement increased slightly in the six years prior to the policy, but rose dramatically

following the implementation of the accountability policy. Achievement levels are roughly 0.25

logits higher under high-stakes. Given a baseline annual gain of 0.50 logits, achievement levels

have risen by the equivalent of one-half of a year’s learning after implementation of high-stakes

testing. The slope of the line is steeper under the new regime because later cohorts were exposed

to the accountability policy for a longer period before taking the exam and/or students and

teachers may have become more efficient at responding to the policy over time.19

In estimating the impact of the high-stakes testing policy on student achievement, it is

important to control for student demographics and prior achievement scores since there is

evidence that the composition of the Chicago student population may have been changing over

this period. Controlling for prior achievement does not present a problem in estimating the

policy effects for the 1997 cohort since this group only experienced the heightened incentives in

19

Page 20: The Impact of High-Stakes Testing on Student Achievement

the 1996-97 school year. However, as already noted, later cohorts experienced the incentives

generated by the accountability policies prior to entering the gate grade. For example, the 1999

sixth grade cohort attended fourth and fifth grades (1997 and 1998) under the new regime. To

the extent that the probation policy influenced student achievement in non-gate grades, or

students, parents or teachers changed behavior in non-gate grades in anticipation of the new

promotional standards in the gate-grades, one would expect student test scores to be higher in

these non-gate grades in comparison to the counterfactual no-policy state.

This presents two potential problems in the estimation. First, these pre-test scores would

be endogenous in the sense that they were caused by the policy. If the policy increased

achievement in these grades, controlling for these prior test scores would decrease the estimated

effect of the policy.20 Second, the inclusion of these endogenous pre-test scores will influence

the estimation of the coefficients on the prior achievement (and thus all other) variables. For this

reason, Table 6.1 shows OLS estimates with various prior achievement controls.21 The first

column for each subject includes controls for three prior years of test score information (t-1, t-2

and t-3). The second column for each subject controls for achievement scores at time t-2 and t-3.

The third columns only control for test scores at t-3.

The preferred estimates (in the shaded cells) suggest that high-stakes testing has

significantly increased student achievement in all subjects, grades and years. For example,

consider the results for eighth grade reading. The estimated effect for the 1996 cohort is 0.132,

meaning that eighth graders in 1995-96 scored 0.132 logits higher than observationally similar

eighth graders in earlier years. To get a sense of the magnitude of this effect, the bottom of the

19 With this data, we are not able to distinguish between an implementation effect and longer exposure to the treatment. 20 Alternatively, these estimates could simply be interpreted as the impact of the policy in the gate year alone.

20

21 To the extent that high-stakes testing has significantly changed the educational production function, we might expect the coefficient estimates to be significantly different for cohorts in the high-stakes testing regime. In this case, one might be concerned that including the post-policy cohorts in the estimation might change the estimation of the policy effect. To check this, I first estimated a model that included only pre-policy cohorts. I then used the coefficient estimates to calculate the policy effect for post-policy cohorts. The results were virtually identical.

Page 21: The Impact of High-Stakes Testing on Student Achievement

table presents the average gain for third, sixth and eighth graders in 1993 to 1995 (with the

standard deviation in parenthesis). Given the average annual learning gain of roughly 0.50,

high-stakes testing increased the reading achievement of eighth graders in 1996 by more than 25

percent of the annual learning gain. The impact on math achievement was roughly 20 percent of

the annual learning gain.22

The one exception occurs in third grade reading in 1997 and 1999. Insofar as first and

second grade teachers have not adjusted instruction since the introduction of high-stakes testing,

perhaps a reasonable assumption given the fact that these grades are not counted in determining

a school’s probation status, then it is not problematic to include controls for first and second

grade achievement. As I suggest in the next section, this may be due to form effects.

6.2. The Sensitivity of High-Stakes Testing Effects

To test the sensitivity of the findings presented in the previous section, Tables 6.2 and

6.3 present comparable estimates for a variety of different specifications and samples. Column 1

shows the baseline estimates for all students who took the test and whose scores were included

for test reporting. Columns 2 to 7 show that the results are robust to using excluded as well as

included students and to a variety of assumptions regarding the hypothetical scores of students

who missed the exam. Columns 8 and 9 suggest that the original estimates are robust to the

inclusion of a single linear trend as well as school-specific trends and school-specific intercepts.

A potentially more serious concern involves the fact that CPS administered several

different forms of the ITBS exam during the 1990s. While the exams are generally comparable,

there is evidence that some forms are slightly easier than others, particularly for certain grades

and subjects and at certain points in the ability distribution. This can be seen in the achievement

22 This table also provides a sensitivity check for the model. The results to the right of each shaded cell within each panel are alternative estimates that use different prior achievement controls. In eighth grade reading, for example, the specification that controls for three years of prior achievement data yields estimates for the 1996 cohort of .132. Specifications that include only two or one year of prior test score information yield estimates of .133 and .114

21

Page 22: The Impact of High-Stakes Testing on Student Achievement

trends shown in Figures 6.1 and 6.2. Some of the year-to-year swings that appear too large to be

explained by a change in the student population or other events taking place in the school system

are likely due to form effects. To the extent that the timing of different forms coincides with the

introduction of the accountability policy, our estimates may be biased. For example, in Table

6.1 the baseline estimates for 1997 and 1999 reading effects are noticeably smaller than the 1998

effect in third and sixth grade. It seems unlikely that any change in the policy could have been

responsible for this pattern, particularly because it is not increasing or decreasing monotonically.

This suggests that the reading section of Form M (the form given in 1997 and 1999) was more

difficult than the comparable section in earlier forms. One way to test the sensitivity of the

findings to potential form effects is to compare achievement in years in which the same form was

administered. Fortunately, the Chicago students took the ITBS Form L in 1994, 1996 and

1998.23 Columns 10 presents estimates from a sample limited to these three cohorts, which

suggest that form effects are not driving the results presented earlier.

6.3 The Heterogeneity of Effects Across Student and School Risk Level

Having examined the aggregate achievement trends in the CPS, we now examine the

heterogeneity of gains across students and schools. Figures 6.3 and 6.4 shows ITBS math and

reading trends at different points on the distribution for third, sixth and eighth grade students.

Achievement increased by roughly the same absolute amount at the different points on the

distribution, suggesting that there was not a significant difference between low and high-

achieving students. Table 6.4 shows OLS estimates of the differential effects across students and

schools. Because the pattern of effects is similar across cohorts, for the sake of simplicity I

respectively. Reading across the rows, it appears that the sixth and eighth grade estimates are not too sensitive to the prior achievement controls.

22

23 This is the only form that was given under low and high-stakes regimes. Form K was administered in 1993 and 1995. Form M was administered in 1997 and 1999. While Chicago students took form K again in 1999-2000, it is difficult to compare this cohort to earlier cohorts because of composition changes caused by the retention of low-achieving students in earlier cohorts.

Page 23: The Impact of High-Stakes Testing on Student Achievement

combine all three cohorts that experienced high-stakes testing and control for all prior

achievement scores, which means that these estimates only capture the effect of high-stakes

testing during the gate grade.

The first row shows that the average effects for all students and schools are positive and

significant. Estimates in the middle panel show that students in low-achieving schools made

larger gains under high-stakes testing than their peers in higher-achieving schools. The bottom

panel shows that the relationship between prior student achievement and the policy effect varies

considerably by subject and grade. Overall, there is not strong evidence that low-achieving

students do better under the policy, perhaps because these students did not have the capacity to

respond to the incentives as well as their peers. Because it appears that the accountability policy

influenced students and schools across the ability distribution, the difference-in-difference

estimates will not estimate the impact of the policy.

6.4 The Impact of High-Stakes Testing in Non-Gate Grades

While the social promotion policy clearly focuses on the three gate grades, it is useful to

examine achievement trends in the non-gate grades as well. Because the probation policy

provides incentives for teachers in non-gate as well as gate grades, the achievement trends in

non-gate grades provide a better measure of the independent incentive effect of the probation

policy.24 As noted earlier, this analysis focuses on the 1996-97 cohort because the social

promotion policy changed the composition of students in later years. Figures 6.5 and 6.6 show

math and reading achievement trends for students in grades four, five and seven who took the

ITBS and were included in test reporting. Math achievement (Figure 6.5) appears to have

continued a steady upward trend that began prior to the introduction of high-stakes testing. In

23

24 Teachers and students in these grades may be anticipating and responding to the promotional standards in the next grade as well. For example, interviews with students and teachers indicate that many students worry about the cutoffs in earlier years and that many teachers use the upcoming, next-year exams as motivation and classroom management techniques.

Page 24: The Impact of High-Stakes Testing on Student Achievement

contrast, reading achievement (Figure 6.6) jumped sharply in 1997 in both fifth and seventh

grades, but there does not appear to be a significant increase in fourth grade. This is consistent

with the fact that the school accountability focuses mainly on reading.

Table 6.5 shows OLS estimates of the policy effects in fourth, fifth and seventh grade.

The aggregate estimates in the top row tell the same story as the graphs—there are significant

positive effects in all subjects and grades except for fourth grade reading. The results in the

middle panel suggest that students in lower achieving schools made larger improvements under

high-stakes testing, although the pattern of effects across student prior achievement levels is

more mixed. An analysis of policy effects within student*school achievement cells show that in

the lowest achieving schools, students above the 25th percentile showed the largest increases in

reading achievement, again consistent with the focus of the accountability policy.

24

Page 25: The Impact of High-Stakes Testing on Student Achievement

7. How Generalizable Were the Test Score Gains in Chicago?

In the previous section, we saw that ITBS math and reading scores in Chicago increased

dramatically following the introduction of high-stakes testing. This section examines whether

the observed ITBS gains reflect a more general increase in student achievement by comparing

student performance trends on the ITBS with comparable trends on the Illinois Goals

Assessment Program (IGAP), a state-mandated achievement exam that is not used for

accountability purposes in Chicago. The data for this analysis is drawn from school “report

cards” compiled by the Illinois State Board of Education (ISBE), which provide average IGAP

scores by grade and subject as well as background information on schools and districts. The

analysis is limited to the period from 1990 to 1998 because Illinois introduced a new exam in

1999.

7.1 A Comparison of ITBS and IGAP Achievement Trends in Chicago Over Time25

Figure 7.1 shows the achievement trends on the math IGAP exam for all students in

Chicago from 1990 to 1998. IGAP math scores increased steadily in all grades over this period,

but no significant discontinuity is evident in 1997.26 Figures 7.2 to 7.4 show both IGAP and

ITBS trends for first-time students in grades three, six and eight whose scores were included in

ITBS reporting. Scores are standardized using the 1993 means and standard deviations. In third

and sixth grade, both the ITBS and IGAP trend upward over this period, although they increase

most steeply at different points. IGAP scores are increasing most sharply from 1994 to 1996

while ITBS scores rise sharply from 1996 to 1998. In the eighth grade, the ITBS and IGAP

trends track each other more closely, although it still appears that ITBS gains outpaced IGAP

scores two and three years after the introduction of high-stakes testing.

25 This analysis is limited to math because equating problems make it difficult to compare IGAP reading scores over time (Pearson 1998).

2526 In 1993, Illinois switched the scaling of the exam slightly which accounts for change from 1992 to 1993.

Page 26: The Impact of High-Stakes Testing on Student Achievement

Table 7.1 shows OLS estimates of the ITBS and IGAP effects. The sample is limited to

the period from 1994 to 1998 because student level IGAP data was not available before this time.

With no control for pre-existing trends in columns 1 and 5, it appears that math achievement has

increased by a similar magnitude on the ITBS and IGAP. However, when we account for pre-

existing achievement trends in columns 2 and 6, ITBS achievement scores appear to have

increased significantly more than IGAP scores among third and sixth grade students. In eighth

grade, however, the ITBS and IGAP remain comparable once we control for pre-existing trends.

(It is important to note that in the eighth grade model, the prior achievement trend is only based

on two data points – 1994 and 1995). Thus, in the third and sixth grades it appears that the

positive impact of high-stakes testing is not generalizable beyond the high-stakes exam. On the

other hand, in eighth grade, the high-stakes testing effects appear more consistent across the

ITBS and IGAP.

26

Page 27: The Impact of High-Stakes Testing on Student Achievement

7.2 A Comparison of IGAP Trends Across School Districts in Illinois

Having examined the ITBS and IGAP math trends in Chicago during the 1990s, we now

compare achievement trends in Chicago to those in other districts in Illinois. This cross-district

comparison not only allows us to examine reading as well as math scores (since the scoring

changes in reading applied to all districts), but also allows us to control for unobservable factors

operating in Illinois. Table 7.2 presents summary statistics of the sample in 1990.27 Columns 1

and 2 show the statistics for Chicago and non-urban districts in Illinois. Column 3 shows the

statistics for urban districts in Illinois excluding Chicago.28 The City of Chicago serves a

significantly more disadvantaged population than the average district in the state, surpassing

even other urban districts in terms of the percent of students receiving free lunch and the percent

of Black and Hispanic systems in the district.

Figures 7.5 and 7.6 show changes over time in the average difference in achievement

scores between Chicago and other urban districts. While it appears that Chicago has narrowed

the achievement gap with other urban districts in Illinois, the trend appears to have begun prior

to the introduction of high-stakes testing. One possible exception is in eighth grade reading,

where it appears that Chicago students may have improved more rapidly than students in other

urban districts following the introduction of high-stakes testing. Tables 7.3 and 7.4 show OLS

estimates from the specifications in equation 3.3 that control for racial composition of the school,

the percent of students receiving free or reduced price lunch, the percent of Limited English

Proficient (LEP) students, school mobility rates, per-pupil expenditures in the district and the

percent of teachers with at least a Masters degree in the district. The sample used in the

27 I delete roughly one-half of one percent of observations that were missing test score or background information. 28 To identify the comparison districts, I first identify districts in the top decile in terms of the percent of students receiving free or reduced price lunch, percent minority students, and total enrollment and in the bottom decile in terms of average student achievement (averaged over third, sixth and eighth grade reading and math scores) based on 1990 data. Not surprisingly, Chicago falls in the bottom of all four categories. Of the 840 elementary districts in 1990, Chicago ranks first in terms of enrollment, 12th in terms of percent of low-income and minority students and 830th in student achievement. Other districts that appear at the bottom of all categories include East St. Louis, Chicago Heights, East Chicago Heights, Calumet, Joliet, Peoria and Arora. I then use the 34 districts (excluding

27

Page 28: The Impact of High-Stakes Testing on Student Achievement

regressions only included information from 1993 to 1998 in order to minimize any confounding

effects from the change in scoring. Recall that Chicago is coded as an urban district in these

models, so that the coefficient on the Chicago indicator variable reflects the difference between

Chicago and all other urban comparison districts. Once we take into account the pre-existing

trends (columns 2, 4 and 6), there is no significant difference between Chicago and the urban

comparison districts in either reading or mathematics. This is true for schools across the

achievement distribution (figures and tables available upon request).

8. Explaining the Differential Gains on the ITBS and the IGAP

There are a variety of reasons why achievement gains could differ across exams under

high-stakes testing, including differential student effort, ITBS-specific test preparation and

cheating.29 Unfortunately, because many of these factors such as effort and test preparation are

not directly observable, it is difficult to attribute a specific portion of the gains to a particular

cause. Instead we now seek to provide some evidence regarding the potential influence of

several factors.

8.1 The Role of Guessing

Prior to the introduction of high-stakes testing, roughly 20 to 30 percent of students left

items blank on the ITBS exams despite the fact that there was no penalty for guessing. One

explanation for the relatively large ITBS gains since 1996 is that the accountability policy

encouraged students to guess more often rather than leaving items blank. If we believe that the

increased test scores were due solely to guessing, we might expect the percent of questions

answered to increase, but the percent of questions answered correctly (as a percent of all

Chicago) that fall into the bottom decile in at least three out of four of the categories. I have experimented with several different inclusion criteria and the results are not sensitive to the choice of the urban comparison group.

28

Page 29: The Impact of High-Stakes Testing on Student Achievement

answered questions) to remain constant or perhaps even decline. However, from 1994 to 1998,

the percent of questions answered correctly increased by roughly 9 percent at the same time that

the percent blank has declined by 36 percent, suggesting that the higher completion rates were

not due entirely to guessing. Even if we were to assume that the increase in item completion is

due entirely to random guessing, however, Table 8.1 shows that guessing could only explain 10

to 20 percent of the observed ITBS gains among those students who were most likely to have left

items blank in the past.

8.2 The Role of Test Preparation

Another explanation for the differential effects involves the content of the exams. If the

ITBS and IGAP emphasize different topics or skills, and teachers have aligned their curriculum

to the ITBS in response to the accountability policy, then we might see disproportionately large

increases on the ITBS. As we can see in Table 8.2, while the two exams have the same general

format, the IGAP appears to place greater emphasis on critical thinking and problem-solving

skills. For example, the IGAP math exam has fewer straight computation questions, and even

these questions are asked in the context of a sentence or word problem. Similarly, with its long

passages, multiple correct answers and questions comparing passages, the IGAP reading exam

appears to be more difficult and more heavily weighted toward critical thinking skills than the

ITBS exam.

To the extent that the disproportionately large ITBS gains were driven by ITBS-specific

curriculum alignment or test preparation, we might expect to see the largest gains on the ITBS

items that are (a) easy to teach to and (b) relatively more common on the ITBS than the IGAP.

Table 8.3 presents OLS estimates of the relationship between high-stakes testing and ITBS math

29

29 Jacob and Levitt (2001) found that instances of classroom cheating increased substantially following the introduction of high-stakes testing. However, they estimate that it could only explain at most 10 percent of the observed ITBS gains during this period.

Page 30: The Impact of High-Stakes Testing on Student Achievement

achievement by item type.30 The sample consists of tested and included students in the 1996 and

1998 cohorts, both of which took ITBS Form L. The dependent variable is the proportion of

students who answered the item correctly in the particular year. We see that students made the

largest gains on items involving computation and number concepts and made the smallest gains

on problems involving estimation, data analysis and problem-solving skills.

Using the coefficients in Table 8.3, we can adjust for the composition differences

between the ITBS and IGAP exam. Suppose that under high-stakes testing students were 2.3

percentage points more likely to correctly answer IGAP math items involving critical thinking or

problem solving skills and 4.2 percentage points more likely to answer IGAP math items

requiring basic skills. Scores on the IGAP math exam range from 1 to 500. Assuming a linear

relationship between the proportion correct and the score, a one percentage point increase in the

proportion correct translates into a five point gain (which, given the 1994 standard deviation of

roughly 84 points, translates into a gain of 0.06 standard deviations). If the distribution of

questions type on the IGAP resembled that on the ITBS, with one-third of the items devoted to

basic skill material, then IGAP scores would have been predicted to increase by an additional

two points.31 Given the fact that IGAP scores rose by roughly 15 points between 1996 and 1998,

it appears that the different compositions of items on the ITBS and IGAP exams cannot explain

much of the observed ITBS-IGAP difference.

8.3. The Role of Effort

If the consequences associated with ITBS performance led students to concentrate harder

during the exam or caused teachers to ensure optimal testing conditions for the exam, and these

factors were not present during the IGAP testing, student achievement on the ITBS may have

exceeded IGAP achievement even if the content of the exams were identical. Because we cannot

30 Because it is difficult to classify items in the reading exam, I limit the analysis to the math exam.

30

Page 31: The Impact of High-Stakes Testing on Student Achievement

directly observe effort, however, it is difficult to determine its role in the recent ITBS

improvements. While increased guessing cannot explain a significant portion of the ITBS gains,

other forms of effort may play a larger role.

Insofar as there is a tendency for children to “give up” toward the end of the exam—

either leaving items blank or filling in answers randomly—an increase in effort may lead to a

disproportionate increase in performance on items at the end of the exam, when students are tired

and more inclined to give up. One might describe this type of effort as test stamina – i.e., the

ability to continue working and concentrating throughout the entire exam. Table 8.4 presents

OLS estimates of the relationship between item position and achievement gains from 1994 to

1998, using the same sample as Table 8.3. The analysis is limited to the reading exam because

the math exam is divided into several sections, so that item position is highly correlated with

item type. Conditional on item difficulty, student performance on the last 20 percent of items

increased 2.7 percentage points more than performance on the first 20 percent of items. In fact,

conditional on the item difficulty, it appears that all of the improvement on the ITBS reading

exam came at the end of the test. This suggests that effort played a significant role in the ITBS

gains seen under high-stakes testing.

9. Unintended Consequences of High-Stakes Testing

Having addressed the impact of the accountability policy on math and reading

achievement, we now examine how high-stakes testing has affected other student outcomes.

While there is little evidence that the accountability policy affected the probability of

transferring to private schools, moving out of the district, or dropping out of school,32 it appears

to have influenced primary grade retentions and achievement scores in science and social

studies.

31 All figures are for the average of third, sixth and eighth grades.

31

Page 32: The Impact of High-Stakes Testing on Student Achievement

9.1 The Impact of High-Stakes Testing on the Early Primary Grades

Table 9.1 presents OLS estimates that control for a variety of student and school

demographics as well as pre-existing trends.33 34 While enrollment rates decreased slightly over

this period, there does not appear to be any dramatic change following the introduction of high-

stakes testing. In contrast, retention rates appear to have increased sharply under high-stakes

testing. The retention rates were quite low among the 1993 cohort, ranging from 1.1 percent for

kindergartners to 4.2 percent for first graders. First graders in 1997-1999 were roughly 50

percent more likely to be retained than comparable peers in 1993 to 1996. Second graders were

nearly twice as likely to be retained after 1997.

9.2 The Impact of High-Stakes Testing on Science and Social Studies Achievement

Figure 9.1 shows fourth grade ITBS achievement trends in all subjects from 1995 to

1997.35 We see that science and social studies scores increased substantially from 1995 to

1996, most likely because 1995 was the first year that these exams were mandatory for fourth

graders. Over this same period, there was little change in math or reading scores. Following the

introduction of high-stakes testing in 1996-97, math and reading scores rose by roughly 0.20

grade equivalents, but there was no change in science or social studies achievement. Figure 9.2

shows similar trends for eighth grade students from 1995 to 1998. The patterns for eighth grade

32 Figures and tables available from the author upon request. 33 Probit estimates yield nearly identical results so the estimates from a linear probability model are presented for ease of interpretation. 34 Roderick et al. (2000) found that retention rates in kindergarten, first and second grades started to rise in 1996 and jumped sharply in 1997 among first and second graders. This analysis builds on the previous work by examining enrollment and special education trends (not shown here) as well as retention, controlling for observable student characteristics and pre-existing trends and examining the heterogeneity across students. 35 Fourth graders did not take the ITBS science and social studies exams before 1995. After 1997, the composition of fourth grade cohorts changed due to the introduction of grade retention for third grade students. In general, a difficulty in studying trends in science and social studies achievement in Chicago is that the exams in these subjects are not given every year and, in many cases, are administered to different grades than the math and reading exams. The CPS has consistently required students in grades two to eight to take the reading and math portions of the ITBS,

32

Page 33: The Impact of High-Stakes Testing on Student Achievement

are less clear, partly because of the noise introduced by the changes in test forms. If we focus on

years in which the same form was administered, 1996 and 1998, we see that the average social

studies achievement stayed relatively constant under high-stakes testing in comparison to scores

in the other subjects which rose by about 0.30 GEs.

IGAP trends reveal a similar pattern. Figure 9.3 shows the social studies trends for

Chicago and other urban districts in Illinois. We see that Chicago was improving relative to

other urban districts in science from 1993 to 1996 in both the fourth and seventh grades. In

1997, this improvement stopped, and even reversed somewhat, leaving Chicago students further

behind their urban counterparts in social studies in 1997. On the other hand, Figure 9.4 suggests

that Chicago students continued to gain on their urban counterparts in science achievement

through 1997.

Table 9.2 presents OLS regression estimates that control for student and school

characteristics. We see that students in both grades had higher math and reading scores after the

introduction of high-stakes testing, but that science achievement stayed relatively flat and social

studies achievement declined slightly. Table 9.3 shows the OLS estimates of the relationship

between high-stakes testing and IGAP science and social studies scores. Conditional on prior

achievement trends, Chicago fourth graders in 1997 performed significantly worse than the

comparison group in science as well as social studies. The effect of 16.4 for social studies is

particularly large, nearly one-third of a standard deviation. The point estimate for seventh grade

science is negative, though not statistically significant and the effect for seventh grade science is

zero.

Overall, it appears that social studies and science achievement have decreased relative to

reading and math achievement. The decrease appears larger for social studies than science and

somewhat larger for fourth grade in comparison to seventh or eighth grade. The decline in social

but has frequently changed which grades are required to take the science and social studies exams. The only

33

Page 34: The Impact of High-Stakes Testing on Student Achievement

studies achievement is consistent with teachers substituting time away from this subject to focus

on reading. One reason that we might not have seen a similar drop in science scores is that

science is often taught by a separate teacher, whereas social studies is more often taught by the

regular teacher who is also responsible for reading and possibly mathematics instruction as well.

10. Conclusions

This analysis has generated a number of significant findings: (1) ITBS scores in

mathematics and reading increased substantially following the introduction of high-stakes testing

in Chicago; (2) the pattern of ITBS gains suggests that low-achieving schools made somewhat

larger gains under the policy, but that there was no consistent pattern regarding the differential

gains of low- versus high-achieving students; (3) in contrast to ITBS scores, IGAP math trends

in third and sixth grade showed little if any change following the introduction of high-stakes

testing; in eighth grade, IGAP math trends increased less sharply than comparable ITBS trends;

(4) a cross-district comparison of IGAP scores in Illinois during the 1990s suggests that high-

stakes testing has not affected math or reading achievement in any grade in Chicago; (5) the

introduction of high-stakes testing for math and reading coincided with a relative decline in

science and social studies achievement, as measured by both the ITBS and IGAP; (6) while the

probability that a student will transfer to a private school, leave the district, or drop out of

elementary school did not increase following the introduction of the accountability policy,

retention rates in first and second grade have increased substantially under high-stakes testing,

despite the fact that these grades are not directly influenced by the social promotion policy.

This analysis highlights both the strengths and weaknesses of high-stakes accountability

policies in education. On one hand, the results presented here suggest that students and teachers

have responded to the high-stakes testing program in Chicago. The likelihood of missing the

34consistent series consists of fourth and eighth grade students from 1995 to 1999.

Page 35: The Impact of High-Stakes Testing on Student Achievement

ITBS declined sharply following the introduction of the accountability policy, particularly

among low-achieving students. ITBS scores increased dramatically beginning in 1997. Survey

and interview data further indicate that many elementary students and teachers are working

harder (Engel and Roderick 2001, Tepper et. al. 2001). All of these changes are most apparent

in the lowest performing schools.

On the other hand, there is evidence that teachers and students have responded to the

policy by concentrating on the most easily observable, high-stakes topics and neglecting areas

with little consequences or topics that are less easily observed. The differential achievement

trends on ITBS and IGAP suggest that the large ITBS gains may have been driven by test-day

motivation, test preparation or cheating and therefore do not reflect greater student

understanding of mathematics or greater facility in reading comprehension. Interviews and

survey data confirm that the majority of teachers have responded to high-stakes testing with low-

effort strategies such as increasing test preparation, aligning curriculum more closely with the

ITBS and referring low achieving students to after-school programs, but have not substantially

changed their classroom instruction (Tepper et. al. 2001). It appears that teachers and students

have shifted resources away from non-tested subjects, particularly social studies, which may be a

source of concern if these areas are important for future student development and growth.

Finally, the finding that the accountability policy increased retention rates in first and second

grade.

These results suggest two different avenues for policy intervention. One alternative is to

expand the scope of the accountability measures – e.g., basing student promotion and school

probation decisions on IGAP as well as ITBS scores, including science and social studies in the

accountability system along with math and reading, etc. While this solution maintains the core

of the accountability concept, it may be expensive to develop and administer a variety of

35

Page 36: The Impact of High-Stakes Testing on Student Achievement

additional accountability measures. Increasing the number of measures used for the

accountability policy may dilute the incentive associated with any one of the measures.

Another alternative is to mandate particular curriculum programs or instructional

practices. For example, in Fall 2001 the CPS will require 200 of the lowest performing

elementary schools to adopt one of several comprehensive reading programs offered by the CPS.

Unfortunately, there is little evidence that such mandated programmatic reforms have significant

effects on student learning (Jacob and Lefgren 2001b). More generally, this approach has many

of the same drawbacks that initially led to the adoption of a test-based accountability system,

including the difficulty of monitoring the activities of classroom teachers, the heterogeneity of

needs and problems and the inherent uncertainty of how to help students learn.

Regardless of the path chosen by Chicago, this analysis illustrate the difficulties

associated with accountability-oriented reform strategies in education, including the difficulty of

monitoring teacher and student behavior, the shortcomings of standardized achievement

measures and the importance of student and school capacity in responding to policy initiatives.

This study suggests that educators and policymakers must carefully examine the incentives

generated by high-stakes accountability policies in order to ensure that the policies foster long-

term student learning and not simply raise test scores in the short-run.

36

Page 37: The Impact of High-Stakes Testing on Student Achievement

References

Becker, W. E. and S. Rosen (1992). “The Learning Effect of Assessment and Evaluation in High

School.” Economics of Education Review 11(2): 107-118.

Bishop, J. (1998). Do Curriculum-Based External Exit Exam Systems Enhance Student

Achievement? Philadelphia, Consortium for Policy Research in Education, University of

Pennsylvania, Graduate School of Education: 1-32.

Bishop, J. H. (1990). “Incentives for Learning: Why American High School Students Compare

So Poorly to their Counterparts Overseas.” Research in Labor Economics 11: 17-51.

Bishop, J. H. (1996). Signaling, Incentives and School Organization in France, the Netherlands,

Britain and the United States. Improving America's Schools: The Role of Incentives. E.

A. Hanushek and D. W. Jorgenson. Washington, D.C., National Research Council: 111-

145.

Bryk, A. S. and S. W. Raudenbush (1992). Hierarchical Linear Models. Newbury Park, Sage

Publications.

Bryk, A. S., P. B. Sebring, et al. (1998). Charting Chicago School Reform. Boulder, CO,

Westview Press.

Cannell, J. J. (1987). Nationally Normed Elementary Achievement Testing in America's Public

Schools: How All Fifty States Are Above the National Average. Daniels, W. V., Friends

for Education.

Carnoy, M., S. Loeb, et al. (2000). Do Higher State Test Scores in Texas Make for Better High

School Outcomes? American Educational Research Association, New Orleans, LA.

37

Page 38: The Impact of High-Stakes Testing on Student Achievement

Catterall, J. (1987). Toward Researching the Connections between Tests Required for High

School Graduation and the Inclination to Drop Out. Los Angeles, University of

California, Center for Research on Evaluation, Standards and Student Testing: 1-26.

Catterall, J. (1989). “Standards and School Dropouts: A National Study of Tests Required for

High School Graduation.” American Journal of Education 98(November): 1-34.

Easton, J. Q., T. Rosenkranz, et al. (2001). Annual CPS Test Trend Review, 2000. Chicago, IL,

Consortium on Chicago School Research.

Easton, J. Q., T. Rosenkranz, et al. (2000). Annual CPS Test Trend Review, 1999. Chicago,

Consortium on Chicago School Research.

ECS (2000). ECS State Notes, Education Commission of the States (www.ecs.org).

Feldman, S. (1996). Do Legislators Maximize Votes? Working paper. Cambridge, MA.

Frederisksen, N. (1994). The Influence of Minimum Competency Tests on Teaching and

Learning. Princeton, Educational Testing Services, Policy Information Center.

Gardner, J. A. (1982). Influence on High School Curriculum on Determinants of Labor Market

Experience. Columbus, The National Center for Research in Vocational Education, Ohio

State University.

Goodlad, J. (1983). A Place Called School. New York, McGraw-Hill.

Griffin, B. W. and M. H. Heidorn (1996). “An Examination of the Relationship between

Minimum Competency Test Performance and Dropping Out of High School.”

Educational Evaluation and Policy Analysis 18(3): 243-252.

Grissmer, D. and A. Flanagan (1998). Exploring Rapid Achievement Gains in North Carolina

and Texas. Washington, D.C., National Education Goals Panel.

Haney, W. (2000). “The Myth of the Texas Miracle in Education.” Education Policy Analysis

Archives 8(41).

38

Page 39: The Impact of High-Stakes Testing on Student Achievement

Heubert, J. P. and R. M. Hauser, Eds. (1999). High Stakes: Testing for Tracking, Promotion and

Graduation. Washington, D.C., National Academy Press.

Jacob, B. A. (2001). “Getting Tough? The Impact of Mandatory High School Graduation

Exams on Student Outcomes.” Educational Evaluation and Policy Analysis

Forthcoming.

Jacob, B. A. and L. Lefgren (2001a). “Making the Grade: The Impact of Summer School and

Grade Retention on Student Outcomes.” Working Paper.

Jacob, B. A. and L. Lefgren (2001b). “The Impact of a School Probation Policy on Student

Achievement in Chicago.” Working Paper.

Jacob, B. A. and S. D. Levitt (2001). “The Impact of High-Stakes Testing on Cheating: Evidence

from Chicago.” Working Paper.

Jacob, B. A., M. Roderick, et al. (2000). The Impact of Chicago's Policy to End Social

Promotion on Student Achievement. American Educational Research Association, New

Orleans.

Kang, S. and J. Bishop (1984). The Impact of Curriculum on the Non-College Bound Youth's

Labor Market Outcomes. High School Preparation for Employment. L. Hotchkiss, J.

Bishop and S. Kang. Columbus, The National Center for Research in Vocational

Education, Ohio State University: 95-135.

Klein, S. P., L. S. Hamilton, et al. (2000). What Do Test Scores in Texas Tell Us? Santa Monica,

CA, RAND.

Koretz, D. (1999). Foggy Lenses: Limitations in the Use of Achievement Tests as Measures of

Educators' Productivity. Devising Incentives to Promote Human Capital, Irvine, CA.

Koretz, D., R. L. Linn, et al. (1991). The Effects of High-Stakes Testing: Preliminary Evidence

About Generalization Across Tests. American Educational Research Association,

Chicago.

39

Page 40: The Impact of High-Stakes Testing on Student Achievement

Koretz, D. M. and S. I. Barron (1998). The Validity of Gains in Scores on the Kentucky

Instructional Results Information System (KIRIS). Santa Monica, RAND.

Kreitzer, A. E., G. F. Madaus, et al. (1989). Competency Testing and Dropouts. Dropouts from

School: Issues, Dilemmas and Solutions. L. Weis, E. Farrar and H. G. Petrie. Albany,

State University of New York Press: 129-152.

Ladd, H. F. (1999). “The Dallas School Accountability and Incentive Program: An Evaluation of

its Impacts on Student Outcomes.” Economics of Education Review 18: 1-16.

Lillard, D. R. and P. P. DeCicca (Forthcoming). “Higher Stanards, More Dropouts? Evidence

Within and Across Time.” Economics of Education Review.

Linn, R. L. (2000). “Assessments and Accountability.” Educational Researcher 29(2): 4-16.

Linn, R. L. and S. B. Dunbar (1990). “The Nation's Report Card Goes Home: Good New and

Bad About Trends in Achievement.” Phi Delta Kappan 72(2): 127-133.

Linn, R. L., M. E. Graue, et al. (1990). “Comparing State and District Results to National

Norms: The Validity of the Claim that 'Everyone is Above Average'.” Educational

Measurement: Issues and Practice 9(3): 5-14.

Little, R. J. A. and D. B. Rubin (1987). Statistical Analysis with Missing Data. New York, John

Wiley & Sons.

MacIver, D. J., D. A. Reuman, et al. (1995). “Social Structuring of the School: Studying What Is,

Illuminating What Could Be.” Annual Review of Psychology 46: 375-400.

MacMillan, D. L., I. H. Balow, et al. (1990). A Study of Minimum Competency Tests and Their

Impact: Final Report. Washington, D.C., Office of Special Education and Rehabilitation

Services, U.S. Department of Education.

Madaus, G. and V. Greaney (1985). “The Irish Experience in Competency Testing: Implications

for American Education.” American Journal of Education 93: 268-294.

40

Page 41: The Impact of High-Stakes Testing on Student Achievement

McNeil, L. and A. Valenzuela (1998). The Harmful Impact of the TAAS System of Testing in

Texas: Beneath the Accountability Rhetoric. High Stakes K-12 Testing Conference,

Teachers College, Columbia University.

Meyer, J. W. and B. Rowan (1978). The Structure of Educational Organizations. Environments

and Organizations. M. Meyer. San Francisco, CA, Jossey-Bass: 78-109.

Meyer, R. (1982). Job Training in the Schools. Job Training for Youth. R. Taylor, H. Rosen and

F. Pratzner. Columbus, National Center for Research in Vocational Education, Ohio State

University.

Neill, M. and K. Gayler (1998). Do High Stakes Graduation Tests Improve Learning Outcomes?

Using State-Level NAEP Data to Evaluate the Effects of Mandatory Graduation Tests.

High Stakes K-12 Testing Conference, Teachers College, Columbia University.

Pearson, D. P. and T. Shanahan (1998). “The Reading Crisis in Illinois: A Ten Year

Retrospective of IGAP.” Illinois Reading Council Journal 26(3): 60-67.

Powell, A. G. (1996). Motivating Students to Learn: An American Dilemma. Rewards and

Reform: Creating Educational Incentives that Work. S. H. Fuhrman and J. A. O'Day. San

Francisco, Jossey-Bass Inc.: 19-59.

Powell, A. G., E. Farrarr, et al. (1985). The Shopping Mall High School: Winners and Losers in

the Educational Marketplace. Boston, Houghten-Mifflin.

Reardon, S. (1996). Eighth Grade Minimum Competency Testing and Early High School

Dropout Patterns. American Educational Research Association Annual Meeting, New

York.

Roderick, M. and M. Engel (2000). The Grasshopper and the Ant: Motivational Responses of

Low-Achieving Students to High-Stakes Testing. American Educational Research

Association, New Orleans.

41

Page 42: The Impact of High-Stakes Testing on Student Achievement

Roderick, M., J. Nagaoka, et al. (2000). Update: Ending Social Promotion. Chicago, IL,

Consortium on Chicago School Research.

Roderick, M., J. Nagaoka, et al. (2001). Helping Students Meet the Demands of High-Stakes

Testing: Is There a Role for Summer School? Annual Meeting of the American

Educational Research Association, Seattle, WA.

Sedlak, M. W., C. W. Wheeler, et al. (1986). Selling Students Short: Classroom Bargains and

Academic Reform in American High Schools. New York, Teachers College Press.

Shepard, L. A. (1988). Should Instruction be Measurement-Driven? American Educational

Research Association, New Orleans.

Shepard, L. A. (1990). “Inflated Test Score Gains: Is the Problem Old Norms or Teaching the

Test?” Educational Measurement: Issues and Practice 9(3): 15-22.

Sizer, T. R. (1984). Horace's Compromise: The Dilemma of the American High School. Boston,

Hougton-Mifflin.

Smith, B. A. and S. Degener (2001). Staying After to Learn: A First Look at Lighthouse. Annual

Meeting of the American Educational Research Association, Seattle, WA.

Stecher, B. M. and S. I. Barron (1999). Quadrennial Milepost Accountability Testing in

Kentucky. Los Angeles, Center for the Study of Evaluation, University of California.

Stevenson, H. W. and J. W. Stigler (1992). The Learning Gap: Why Our Schools are Failing and

What We Can Learn from Japanese and Chinese Education. New York, Simon and

Schuster.

Tepper, R., M. Roderick, et al. (Forthcoming). The Impact of High-Stakes Testing on Classroom

Instructional Practices in Chicago. Chicago, IL, Consortium on Chicago School

Research.

Tepper, R. L. (2001). The Influence of High-Stakes Testing on Instructional Practice in Chicago.

American Educational Research Association, Seattle, WA.

42

Page 43: The Impact of High-Stakes Testing on Student Achievement

Tomlinson, T. M. and C. T. Cross (1991). “Student Effort: The Key to Higher Standards.”

Educational Leadership(September): 69-73.

Tyack, D. B. and L. Cuban (1995). Tinkering Toward Utopia: A Century of Public School

Reform. Cambridge, MA, Harvard University Press.

Waller, W. (1932). Sociology of Teaching. New York, Wiley.

Weick, K. E. (1976). “Educational Organizations as Loosely Coupled Systems.” Administrative

Science Quarterly 21: 1-19.

Winfield, L. F. (1990). “School Competency Testing Reforms and Student Achievement:

Exploring a National Perspective.” Educational Evaluation and Policy Analysis 12(2):

157-173.

Wright, B. and M. H. Stone (1979). Best Test Design. Chicago, IL, MESA Press.

43

Page 44: The Impact of High-Stakes Testing on Student Achievement

Table 2.1 Descriptive Statistics on the Social Promotion Policy in Chicago 3rd Grade 6th Grade 8th Gradeb 1995-96 Total Fall Enrollment 29,664 Promotional Cutoffs for ITBS (Grade Equivalents) 6.8 Percent Subject to Policy (% tested and included) .82 Percent Failed to Meet Promotional Criteriaa .33

Percent Failed Reading .07 Percent Failed Math .13 Percent Failed Reading and Math .13

Percent retained or in transition center next yeara .07 1996-97 Total Fall Enrollment 34,775 31,385 29,437 Promotional Cutoffs for ITBS (Grade Equivalents) 2.8 5.3 7.0 Percent Subject to Policy (% tested and included) .71 .81 .80 Percent Failed to Meet Promotional Criteriaa .49 .35 .30

Percent Failed Reading .06 .05 .06 Percent Failed Math .17 .17 .13 Percent Failed Reading and Math .26 .13 .11

Percent retained or in transition center next yeara .21 .13 .13 1997-98 Total Fall Enrollment 39,400 33,334 31,045 Promotional Cutoffs for ITBS (Grade Equivalents) 2.8 5.3 7.2 Percent Subject to Policy (% tested and included) .71 .81 .80 Percent Failed to Meet Promotional Criteriaa .43 .31 .34

Percent Failed Reading .02 .07 .08 Percent Failed Math .20 .13 .11 Percent Failed Reading and Math .19 .11 .14

Percent retained or in transition center next yeara .23 .13 .14 1998-99 Total Fall Enrollment 40,957 33,173 29,838 Promotional Cutoffs for ITBS (Grade Equivalents) 2.8 5.3 7.4 Percent Subject to Policy (% tested and included) .69 .80 .79 Percent Failed to Meet Promotional Criteriaa .40 .29 .31

Percent Failed Reading .05 .06 .07 Percent Failed Math .19 .14 .12 Percent Failed Reading and Math .16 .09 .12

Percent retained or in transition center next yeara .19 .12 .10 1999-2000 Total Fall Enrollment 40,691 31,149 31,387 Promotional Cutoffs for ITBS (Grade Equivalents) 2.8 5.5 7.7 Percent Subject to Policy (% tested and included) .69 .79 .77 Percent Failed to Meet Promotional Criteriaa .41 .31 .40

Percent Failed Reading .03 .04 .09 Percent Failed Math .22 .17 .14 Percent Failed Reading and Math .16 .10 .17

Percent retained or in transition center next yeara .12 .09 .12 Notes: Those students who were not retained or placed in a transition center (1) were promoted, (2) were placed in a non-graded special education classroom, or (3) left the CPS. a Conditional on being subject to the accountability policy. b The eighth grade sample includes students in transition centers whose transcripts indicated they were in either eighth or ninth grade. 44

Page 45: The Impact of High-Stakes Testing on Student Achievement

Table 4.1: Summary Statistics

Variables 3rd Grade 6th Grade 8th Grade

Student Outcomes

Missed ITBS 0.178 0.076 0.087 Excluded from ITBS 0.122 0.116 0.111

ITBS Math Score (GE) 3.471 (.962)

6.363 (1.277)

7.972 (1.287)

ITBS Reading Score (GE) 3.082 (1.122)

6.025 (1.484)

7.881 (1.758)

ITBS Math Score (Rasch) -1.247 (1.116)

0.691 (.918)

1.539 (.777)

ITBS Reading Score (Rasch) -1.271 (1.091)

0.087 (970)

1.123 (.909)

Transferred to private school 0.012 0.012 0.058 Moved out of CPS 0.053 0.049 0.069 Dropped Out 0.010 0.010 0.054

Prior Achievement Indicators

School < 15% at norms 0.274 0.226 0.207

School 15-25% at norms 0.374 0.387 0.399

School > 25% at norms 0.353 0.387 0.393

Student < 10% 0.110 0.161 0.122

Student 10-25% 0.263 0.313 0.298

Student 25-35% 0.151 0.154 0.172

Student 35-50% 0.178 0.162 0.186

Student > 50% 0.298 0.210 0.222

Student Demographics

Male 0.488 0.479 0.473 Black 0.654 0.552 0.554 Hispanic 0.205 0.304 0.299 Black Male 0.316 0.259 0.255 Hispanic Male 0.103 0.150 0.146

Age 9.219 (.590)

12.210 (.483)

14.166 (.524)

Living in foster care 0.048 0.035 0.028 Living with at least one parent 0.855 0.876 0.901

Neighborhood Poverty 0.315 (.742)

0.236 (.236)

0.233 (.704)

Neighborhood Social Status -0.261 (.687)

-0.275 (.689)

-0.275 (.684)

Special Education 0.036 0.024 0.018

Free lunch 0.661 0.642 0.586 Reduced price lunch 0.065 0.072 0.068

45

Page 46: The Impact of High-Stakes Testing on Student Achievement

Currently in bilingual program 0.048 0.095 0.076 Past bilingual participation 0.116 0.192 0.210

1 School Demographics % Limited English Proficient 12.245 15.858 15.502 % Low Income 82.459 81.974 81.263 % Daily Attendance 92.563 92.733 92.555 Mobility Rate 29.865 28.418 28.522 School Enrollment 767.588 780.305 772.123 > 85% African-American 0.545 0.450 0.445 > 85% Hispanic 0.081 0.126 0.119 > 85% African-American + Hispanic 0.103 0.124 0.134 70-85% African-American + Hispanic 0.082 0.098 0.094 < 70% African-American + Hispanic 0.189 0.203 0.208 Census Tract Characteristics Census Tract Population 4848.064 4924.203 4912.159 Crime Composite 0.176 0.086 0.084 % School Age Children 0.235 0.232 0.232 % Black 0.589 0.505 0.508 % Hispanic 0.169 0.223 0.223 Median HH Income 22448.200 23093.400 23183.110 % Own Home 0.396 0.402 0.405 % Lived in Same House for 5 Years 0.577 0.567 0.569 Mean Education Level 12.099 12.084 12.083 % Managers/Professionals 0.173 0.170 0.170 Poverty Rate 0.282 0.262 0.261 Unemployment Rate 0.419 0.405 0.404 Female Headed HH 0.442 0.402 0.401 Number of observations 165,143 169,377 139,683 Notes: The sample includes all tested and included students in cohorts 1993 to 1999 (1993 to 1998 for eighth grade) who were not missing demographic information. Sample sizes for the first two statistics—missing ITBS and excluded from ITBS—are larger than shown in this table because they are taken from the entire sample.

46

Page 47: The Impact of High-Stakes Testing on Student Achievement

Table 5.1: OLS Estimates of the Probability of Not Being Tested or Being Excluded from Reporting Dependent Variable

Not Tested Excluded (Conditional on Tested)

Full Black Full Black

3rd Grade

1997 Cohort -.016 (.006)

-.008 (.005)

.058 (.006)

.003 (.002)

1998 Cohort -.029 (.008)

-.016 (.006)

.064 (.007)

-.000 (.002)

1999 Cohort -.144 (.014)

-.008 (.007)

.148 (.010)

-.001 (.003)

Baseline Rates .055 .054

Number of Observations 228,734 122,251 190,762 117,238

R-Squared .37 .09 .56 .68

6th Grade

1997 Cohort -.011 (.003)

-.011 (.003)

.018 (.002)

.005 (.002)

1998 Cohort -.018 (.004)

-.009 (.004)

.010 (.002)

.003 (.002)

1999 Cohort -.019 (.005)

-.008 (.005)

.020 (.003)

-.001 (.003)

Baseline Rates .045 .086

Number of Observations 207,578 110,350 193,671 106,947

R-Squared .28 .08 .79 .84

8th Grade

1996 Cohort -.019 (.004)

-.028 (.005)

.005 (.002)

.006 (.003)

1997 Cohort -.032 (.006)

-.043 (.008)

.023 (.003)

.010 (.003)

1998 Cohort -.040 (.008)

-.057 (.010)

.018 (.003)

.009 (.004)

Baseline Rates .057 .080

Number of Observations 172,205 92,781 159,559 89,042

R-Squared .28 .09 .84 .86

Notes: The sample includes all first-time students in these grades from 1993 to 1999 (1993-98 for eighth grade). Control variables include a time year, race, gender, race*gender interactions, age, household composition, and an indicator of previous special education placement along with up to three years of prior reading and math achievement (linear, square and cubic terms). Missing test scores are set to zero and a variable is included indicating the score is missing. Robust standard errors that account for the correlation of errors within school are presented in parentheses.

47

Page 48: The Impact of High-Stakes Testing on Student Achievement

Table 5.2: OLS Estimates of the Probability of Not Being Tested or Being Excluded from Reporting, by Prior Achievement Level (Black Students Only) Missed Test Excluded

3rd 6th 8th 3rd 6th 8th

Average Effect -.011 (.005)

-.011 (.003)

-.022 (.005)

.002 (.001)

.005 (.002)

.006 (.002)

Average Effect by School Prior Achievement

< 15% students scored above the 50th percentile

-.015 (.005)

-.015 (.004)

-.032 (.006)

.004 (.003)

.009 (002)

.006 (.003)

15-25% students scored above the 50th percentile

-.006 (.005)

-.010 (.004)

-.021 (.005)

.001 (.002)

.004 (.002)

.006 (.003)

> 25% students scored above the 50th percentile

-.013 (.007)

-.004 (.005)

-.007 (.005)

.001 (.002)

.001 (.003)

.004 (.004)

Average Effect by Student Prior Achievement Level

< 10th percentile -.005 (.007)

-.015 (.004)

-.043 (.007)

.016 (.004)

.021 (.004)

.025 (.004)

10-25th percentile -.004 (.004)

-.013 (.003)

-.019 (.005)

.003 (.003)

.006 (.002)

.010 (.003)

25-35th percentile -.012 (.005)

-.008 (.004)

-.019 (.005)

.002 (.003)

-.004 (.002)

-.002 (.003)

35-50th percentile -.007 (.004)

-.009 (.004)

-.014 (.005)

-.005 (.002)

-.004 (.002)

-.006 (.003)

> 50th percentile -.004 (.004)

-.005 (.003)

-.013 (.005)

-.004 (.002)

-.007 (.002)

-.006 (.002)

Notes: Each cell contains an estimate from a separate regression. The estimate reflects the coefficient on a variable indicating whether the student was part of a post-policy cohort. School prior achievement is based on 1995 reading scores. Student prior achievement is based on the average of all non-missing test scores in the years t-2 and t-3 for sixth and eighth graders and t-1 and t-2 for third graders. The sample includes all black first-time students in these grades from 1993 to 1999 (1993-98 for eighth grade). Control variables include a year trend, race, gender, race*gender interactions, age, household composition, and an indicator of previous special education placement along with up to three years of prior reading and math achievement (linear, square and cubic terms). Missing test scores are set to zero and a variable is included indicating the score is missing. Robust standard errors that account for the correlation of errors within school are presented in parentheses.

48

Page 49: The Impact of High-Stakes Testing on Student Achievement

Table 5.3: OLS Estimates of the Probability of Not Being Tested or Being Excluded from Reporting (Black Students) Missed Test Excluded

4th 5th 7th 4th 5th 7th

Average Effect -.007 (.003)

-.010 (.003)

-.013 (.004)

.001 (.002)

.008 (.002)

.003 (.002)

Average Effect by School Prior Achievement

< 15% students scored above the 50th percentile

-.009 (.005)

-.013 (.005)

-.022 (.005)

.001 (.003)

.008 (003)

.002 (.003)

15-25% students scored above the 50th percentile

-.003 (.004)

-.007 (.004)

-.013 (.005)

-.003 (.003)

.009 (.002)

.003 (.003)

> 25% students scored above the 50th percentile

-.009 (.005)

-.009 (.004)

-.003 (.006)

-.004 (.004)

.005 (.003)

.004 (.004)

Average Effect by Student Prior Achievement Level

< 10th percentile -.008 (.006)

-.005 (.006)

-.020 (.007)

.012 (.005)

.025 (.004)

.022 (.005)

10-25th percentile -.000 (.005)

-.009 (.004)

-.010 (.005)

.002 (.004)

.010 (.003)

.005 (.003)

25-35th percentile -.005 (.005)

-.008 (.004)

-.015 (.005)

-.002 (.004)

.003 (.003)

-.002 (.003)

35-50th percentile -.007 (.004)

-.011 (.004)

-.010 (.005)

-.003 (.003)

-.004 (.003)

-.010 (.003)

> 50th percentile -.005 (.004)

-.003 (.004)

-.013 (.004)

-.007 (.003)

-.004 (.002)

-.009 (.003)

Notes: Each cell contains an estimate from a separate regression. The estimate reflects the coefficient on a variable indicating whether the student was part of a post-policy cohort. School prior achievement is based on 1995 achievement scores. Student prior achievement is based on the average of all non-missing test scores in the years t-2 and t-3 for sixth and eighth graders and t-1 and t-2 for third graders. The sample includes all black first-time students in these grades from 1993 to 1999 (1993-98 for eighth grade). Control variables include a year trend, race, gender, race*gender interactions, age, household composition, and an indicator of previous special education placement along with up to three years of prior reading and math achievement (linear, square and cubic terms). Missing test scores are set to zero and a variable is included indicating the score is missing. Robust standard errors that account for the correlation of errors within school are presented in parentheses.

49

Page 50: The Impact of High-Stakes Testing on Student Achievement

Table 6.1: OLS Estimates of ITBS Math and Reading Achievement Dependent Variable

ITBS reading score controlling for achievement in the following periods:

ITBS math score controlling for achievement in the following periods:

2 t-1,t-2,t-3 t-2, t-3 t-3 t-1,t-2,t-3 t-2, t-3 t-3

2.1 3rd Grade 2.2 2.3

1997 Cohort .013 (.011)

.055 (.012)

.077 (.011)

.115 (.013)

.176 (.014)

.194 (.014)

1998 Cohort .102 (.012)

.213 (.013)

.242 (.014)

.142 (.014)

.250 (.016)

.291 (.016)

1999 Cohort .025 (.014)

.062 (.15)

.204 (.014)

.136 (.015)

.186 (.016)

.338 (.016)

6th Grade

1997 Cohort .045 (.007)

.061 (.007)

.053 (.008)

.069 (.007)

.093 (.008)

.097 (.009)

1998 Cohort .099 (.008)

.142 (.009)

.165 (.009)

.126 (.008)

.164 (.009)

.196 (.010)

1999 Cohort .055 (.007)

.093 (.008)

.105 (.009)

.067 (.008)

.123 (.009)

.167 (.011)

8th Grade

1996 Cohort .132 (.008)

.133 (.008)

.114 (.008)

.095 (.006)

.078 (.007)

.069 (.007)

1997 Cohort .158 (.009)

.182 (.009)

.189 (.010)

.115 (.007)

.132 (.008)

.139 (.009)

1998 Cohort .165 (.009)

.231 (.009)

.264 (.010)

.170 (.008)

.233 (.009)

.272 (.010)

Baseline Gains 1 Year Gain 2 Year Gain 3 Year Gain 1 Year Gain 2 Year Gain 3 Year Gain

3rd Grade .599 (.824)

1.444 (1.011) -- .606

(.772) 1.354 (.913) --

6th Grade .503 (.647)

.945 (.674)

1.432 (.773)

.659 (.479)

1.323 (.586)

2.027 (.726)

8th Grade .484 (.593)

1.013 (.619)

1.490 (.637)

.472 (.412)

.938 (.509)

1.594 (.578)

Notes: The sample includes all first-time students in these grades from 1993 to 1999 (1993-98 for eighth grade). Control variables include race, gender, race*gender interactions, age, household composition, and an indicator of previous special education placement along with up to three years of prior reading and math achievement (linear, square and cubic terms). Missing test scores are set to zero and a variable is included indicating the score is missing. Robust standard errors that account for the correlation of errors within school are presented in parentheses. Baseline gains are based on the average gains for the 1993 to 1995 cohorts. The end point in the gain scores is always the grade in question. For example, one year gains for 6th graders refers to the 6th grade score – 5th grade score; two year gains are calculated as 6th grade score – 4th grade score and three year gains are calculated as 6th grade score – 3rd grade score.

50

Page 51: The Impact of High-Stakes Testing on Student Achievement

Table 6.2: OLS Estimates of ITBS Reading Achievement Using Various Specifications and Samples Specification

Baseline Included

& Excluded

Black Included

Black Included

& Excluded

Missing Data

Imputed at

0%ile in school

Missing Data

Imputed at

25%ile in school

Missing Data

Imputed at

50%ile in school

With Aggregate

Year Trend

School Specific FE

and Trends

Form L Only

Form L w/

AggregateYear

Trend

(1) (2) (3) (4) (5) (6) (7) (8) (9) (9) (10)

3rd Grade 2.4 2.5

1997 Cohort .013 (.011)

.008 (.011)

.032 (.013)

.030 (.013)

.050 (.017)

.033 (.012)

.022 (.012)

.091 (.015)

.090 (.008)

1998 Cohort .213 (.013)

.195 (.013)

.154 (.016)

.145 (.015)

.222 (.019)

.190 (.014)

.193 (.013)

.358 (.020)

.349 (.011)

.234 (.014)

.199 (.026)

1999 Cohort .204 (.014)

.064 (.015)

.217 (.016)

.205 (.015)

.336 (.018)

.160 (.013)

.082 (.015)

.218 (.027)

.214 (.014)

6th Grade

1997 Cohort .045 (.007)

.039 (.007)

.051 (.009)

.045 (.009)

.055 (.008)

.045 (.007)

.038 (.007)

.033 (.010)

.037 (.006)

1998 Cohort .142 (.009)

.117 (.009)

.166 (.011)

.132 (.011)

.151 (.010)

.129 (.009)

.118 (.009)

.167 (.013)

.170 (.008)

.141 (.008)

.072 (.013)

1999 Cohort .105 (.009)

.079 (.009)

.116 (.011)

.089 (.011)

.144 (.011)

.099 (.009)

.080 (.009)

.163 (.016)

.168 (.010)

8th Grade

1996 Cohort .132 (.008)

.124 (.007)

.150 (.011)

.139 (.010)

.127 (.010)

.123 (.008)

.121 (.008)

.154 (.010)

.150 (.006)

.141 (.008) --

1997 Cohort .182 (.009)

.165 (.009)

.199 (.013)

.180 (.013)

.187 (.012)

.167 (.009)

.160 (.009)

.208 (.015)

.204 (.009)

1998 Cohort .264 (.010)

.234 (.010)

.290 (.012)

.256 (.012)

.273 (.013)

.238 (.010)

.226 (.010)

.227 (.020)

.227 (.012)

.170 (.009) --

Notes: The sample includes all first-time students in these grades from 1993 to 1999 (1993-98 for eighth grade). Control variables include race, gender, race*gender interactions, age, household composition, and an indicator of previous special education placement along with up to three years of prior reading and math achievement (linear, square and cubic terms). Missing test scores are set to zero and a variable is included indicating the score is missing. Robust standard errors that account for the correlation of errors within school are presented in parentheses.

51

Page 52: The Impact of High-Stakes Testing on Student Achievement

Table 6.3: OLS Estimates of ITBS Math Achievement Using Various Specifications and Samples Specification

Baseline Included

& Excluded

Black Included

Black Included

& Excluded

Missing Data

Imputed at

0%ile in school

Missing Data

Imputed at

25%ile in school

Missing Data

Imputed at

50%ile in school

With Aggregate

Year Trend

School Specific FE

and Trends

Form L Only

Form L w/

AggregateYear

Trend

(1) (2) (3) (4) (5) (6) (7) (8) (9) (9) (10)

3rd Grade

1997 Cohort .115 (.013)

.122 (.014)

.127 (.016)

.126 (.016)

.162 (.017)

.152 (.014)

.137 (.014)

.218 (.017)

.215 (.007)

1998 Cohort .250 (.016)

.256 (.015)

.238 (.019)

.233 (.018)

.303 (.020)

.270 (.016)

.255 (.015)

.428 (.023)

.418 (.011)

.290 (.016)

.175 (.028)

1999 Cohort .338 (.016)

.256 (.016)

.347 (.019)

.335 (.019)

.526 (.020)

.367 (.015)

.280 (.015)

.346 (.029)

.341 (.010)

6th Grade

1997 Cohort .069 (.007)

.062 (.007)

.067 (.009)

.057 (.009)

.080 (.009)

.069 (.008)

.065 (.008)

.096 (.010)

.098 (.005)

1998 Cohort .164 (.009)

.144 (.009)

.170 (.012)

.140 (.011)

.180 (.011)

.158 (.009)

.147 (.009)

.257 (.015)

.255 (.006)

.172 (.009)

.152 (.017)

1999 Cohort .167 (.011)

.135 (.010)

.178 (.013)

.139 (.012)

.201 (.013)

.158 (.011)

.141 (.011)

.280 (.019)

.281 (.009)

8th Grade

1996 Cohort .095 (.006)

.087 (.006)

.099 (.008)

.086 (.008)

.093 (.010)

.089 (.006)

.086 (.006)

.201 (.008)

.198 (.004)

.057 (.007) --

1997 Cohort .132 (.008)

.114 (.008)

.148 (.011)

.122 (.011)

.144 (.011)

.116 (.008)

.113 (.009)

.274 (.013)

.268 (.007)

1998 Cohort .272 (.010)

.246 (.010)

.284 (.012)

.248 (.012)

.290 (.013)

.253 (.010)

.243 (.010)

.363 (.018)

.361 (.009)

.128 (.009) --

Notes: The sample includes all first-time students in these grades from 1993 to 1999 (1993-98 for eighth grade). Control variables include race, gender, race*gender interactions, age, household composition, and an indicator of previous special education placement along with up to three years of prior reading and math achievement (linear, square and cubic terms). Missing test scores are set to zero and a variable is included indicating the score is missing. Robust standard errors that account for the correlation of errors within school are presented in parentheses.

52

Page 53: The Impact of High-Stakes Testing on Student Achievement

Table 6.4: OLS Estimates of High-Stakes Testing Effects in Gate Grades Reading Math 3rd

Grade 6th

Grade 8th

Grade 3rd

Grade 6th

Grade 8th

Grade Aggregate Effects

All Students .046 (.009)

.066 (.005)

.150 (.006)

.132 (.011)

.088 (.006)

.125 (.005)

Effects by School Prior

Achievement

< 15% students in the school scored above the 50th percentile

.075 (.017)

.108 (.012)

.177 (.013)

.180 (.026)

.116 (.012)

.134 (.009)

15-25% students in the school scored above the 50th percentile

.059 (.013)

.064 (.006)

.147 (.011)

.117 (.016)

.072 (.010)

.140 (.009)

> 25% students in the school scored above the 50th percentile

.015 (.015)

.044 (.009)

.139 (.009)

.113 (.017)

.087 (.010)

.106 (.008)

Effects by Student Prior

Achievement

< 10th percentile .041 (.013)

.120 (.008)

.103 (.012)

.161 (.016)

.055 (.008)

.063 (.008)

10-25th percentile .071 (.011)

.060 (.007)

.146 (.007)

.153 (.014)

.067 (.007)

.108 (.004)

25-35th percentile .081 (.013)

.053 (.009)

.166 (.009)

.131 (.015)

.095 (.008)

.144 (.007)

35-50th percentile .038 (.013)

.044 (.008)

.147 (.009)

.111 (.015)

.103 (.009)

.146 (.007)

> 50th percentile -.084 (.015)

.030 (.010)

.125 (.009)

.052 (.015)

.129 (.008)

.136 (.008)

Notes: Each cell contains an estimate from a separate regression. The estimate reflects the coefficient on a variable indicating whether the student was part of a post-policy cohort. School prior achievement is based on 1995 achievement scores. Student prior achievement is based on the average of all non-missing test scores in the years t-2 and t-3 for sixth and eighth graders and t-1 and t-2 for third graders. The sample includes all first-time students in these grades from 1993 to 1999 (1993-98 for eighth grade). Control variables include race, gender, race*gender interactions, age, household composition, and an indicator of previous special education placement along with up to three years of prior reading and math achievement (linear, square and cubic terms). Missing test scores are set to zero and a variable is included indicating the score is missing. Robust standard errors that account for the correlation of errors within school are presented in parentheses.

53

Page 54: The Impact of High-Stakes Testing on Student Achievement

Table 6.5: OLS Estimates of High-Stakes Testing Effects for Non-Gate Grades in 1997 Reading Math 4th

Grade 5th

Grade 7th

Grade 4th

Grade 5th

Grade 7th

Grade

Aggregate Effects

1997 Effect -.015 (.008)

.089 (.007)

.105 (.006)

.071 (.004)

.040 (.008)

.093 (.007)

1998 Effect .068 (.008) .130

(.010)

By School Prior Achievement

< 15% students in the school scored above the 50th percentile

.019 (.017)

.106 (.008)

.102 (.013)

.144 (.020)

.083 (.013)

.093 (.015)

15-25% students in the school scored above the 50th percentile

-.032 (.012)

.072 (.007)

.098 (.010)

.083 (.014)

.059 (.010)

.112 (.009)

> 25% students in the school scored above the 50th percentile

-.016 (.014)

.048 (.008)

.116 (.009)

.017 (.014)

.051 (.011)

.076 (.011)

By Student Prior Achievement

< 10th percentile -.054 (.015)

.088 (.009)

.092 (.012)

.158 (.015)

.072 (.009)

.049 (.010)

10-25th percentile -.014 (.012)

.109 (.008)

.052 (.008)

.135 (.014)

.065 (.008)

.065 (.007)

25-35th percentile -.001 (.015)

.103 (.010)

.076 (.010)

.082 (.013)

.069 (.010)

.096 (.009)

35-50th percentile .003 (.013)

.077 (.010)

.107 (.010)

.045 (.013)

.058 (.010)

.115 (.011)

> 50th percentile .012 (.014)

-.010 (.010)

.173 (.011)

-.015 (.013)

.036 (.010)

.157 (.011)

Notes: Each cell contains an estimate from a separate regression. The estimate reflects the coefficient on a variable indicating whether the student was part of a post-policy cohort. School prior achievement is based on 1995 achievement scores. Student prior achievement is based on the average of all non-missing test scores in the years t-2 and t-3. The sample includes all first-time students in these grades from 1993 to 1997. Control variables include race, gender, race*gender interactions, age, household composition, and an indicator of previous special education placement along with up to three years of prior reading and math achievement (linear, square and cubic terms). Missing test scores are set to zero and a variable is included indicating the score is missing. Robust standard errors that account for the correlation of errors within school are presented in parentheses.

54

Page 55: The Impact of High-Stakes Testing on Student Achievement

Table 7.1: OLS Estimates of ITBS and IGAP Math Achievement

Iowa Test of Basic Skills (High-Stakes Exam)

Illinois Goals Assessment Program (Low-Stakes Exam)

(1) (2) (3) (4) (5) (6) (7) (8)

2.6 3rd Grade 2.7 2.8 2.9 2.10 2.11 2.12 2.13 2.14

1997 Cohort .146 (.012)

.039 (.016) .184

(.012) -.131 (.019)

1998 Cohort .264 (.013)

.162 (.024)

.249 (.014)

.151 (.024)

.272 (.015)

-.140 (.028)

.226 (.015)

-.174 (.029)

R-Squared .563 .564 .392 .392 .559 .569 .389 .401

6th Grade

1997 Cohort .091 (.008)

.027 (.012) .159

(.009) -.037 (.014)

1998 Cohort .194 (.010)

.175 (.018)

.178 (.010)

.157 (.018)

.231 (.011)

.009 (.022)

.217 (.012)

-.002 (.022)

R-Squared .757 .757 .660 .660 .718 .722 .646 .649

8th Grade

1996 Cohort .132 (.007)

.337 (.012)

.078 (.008) -- .107

(.009) .183

(.015) .083

(.011) --

1997 Cohort .153 (.008)

.493 (.020) .171

(.010) .297

(.025)

1998 Cohort .279 (.008)

.976 (.027)

.182 (.011) -- .208

(.012) .621

(.033) .148

(.013) --

R-Squared .764 .766 .767 .743 .743 .742

Allow for pre-existing trend No Yes No Yes No Yes No Yes

Cohorts 1994-98 1994-98 Form L (94,96,98)

Form L (94,96,98) 1994-98 1994-98 Form L

(94,96,98) Form L

(94,96,98)

Notes: The sample includes all first-time students in these grades from 1994 to 1998. Control variables include race, gender, race*gender interactions, age, household composition, and an indicator of previous special education placement along with up to three years of prior reading and math achievement (linear, square and cubic terms). Missing test scores are set to zero and a variable is included indicating the score is missing. Robust standard errors that account for the correlation of errors within school are presented in parentheses.

55

Page 56: The Impact of High-Stakes Testing on Student Achievement

Table 7.2 Summary Statistics for Illinois Public Schools in 1990

Chicago Non-Urban Districts

Other Urban Districts

(1) (2) (3) School Test Scores 4th Grade Science 166 278 211 7th Grade Science 182 279 205 4th Grade Social Studies 163 279 212 7th Grade Social Studies 185 277 207 3rd Grade Math 190 295 233 6th Grade Math 191 279 219 8th Grade Math 206 291 222 3rd Grade Reading 157 275 205 6th Grade Reading 184 285 221 8th Grade Reading 208 278 220 Average Test Score 190 286 221 Average % Tested 82 90 85 School Demographics Average School Enrollment 795 487 506 % Black 55.7 6.0 45.8 % Hispanic 29.4 4.4 14.9 % Asian 3.0 3.0 1.1 % Native American 0.2 0.1 0.1 % LEP 15.3 2.3 7.0 % Free or Reduced Price Lunch 71.8 17.6 56.2 % Avg. Daily Attendance 91.9 95.6 93.5 % Mobility 33.7 14.8 32.6 District Characteristics Pupil-Teacher Ratio 20.3 19.8 20.2 Log(Avg. Teacher Salary) 10.7 10.5 10.5 Log(Per Pupil Expenditures) 8.7 8.4 8.5 Teacher Experience (years) 16.6 15.3 16.0 % Teachers with B.A. 56.4 56.6 56.6 % Teachers with M.A.+ 42.9 43.4 211 Number of Schools 470 3,055 242 Number of Districts 1 833 34 Average District Enrollment 306,500 3,769 7,089 Notes: The figures presented above are averages for all schools weighted by the school enrollment.

56

Page 57: The Impact of High-Stakes Testing on Student Achievement

Table 7.3: OLS Estimates of IGAP Math Achievement in Illinois from 1993 to 1998 3rd Grade 6th Grade 8th Grade

1 2 3 4 5 6

Chicago*(1997-98) 13.4 (4.7)

-3.1 (3.5)

9.0 (2.6)

-2.8 (2.4)

3.6 (2.5)

-0.5 (3.3)

Urban*(1997-98) 4.1 (2.5)

0.6 (3.0)

3.3 (1.8)

8.3 (2.4)

4.9 (2.6)

6.5 (3.2)

1997-98 3.8 (1.4)

-2.6 (1.4)

4.0 (1.1)

-6.8 (1.0)

7.1 (1.0)

-0.4 (1.0)

Chicago -19.4 (5.7)

-41.8 (7.8)

-12.4 (4.0)

-26.6 (5.6)

-2.2 (3.4)

-6.0 (5.5)

Urban 7.0 (6.6)

2.3 (7.6)

5.2 (4.8)

13.1 (6.1)

9.8 (2.8)

12.3 (6.1)

Trend*Chicago 6.0 (1.1) 4.2

(0.8) 1.5 (1.2)

Trend*Urban 1.3 (0.8) -1.6

(0.8) -0.5 (1.1)

Trend 2.6 (0.6) 4.1

(0.4) 2.8 (0.4)

R-Squared .67 .76 .81

Number of Obs 13,751 10,939 8,199

Difference between Chicago and Other Urban Districts in Illinois

13.4 -3.1 9.0 -2.8 3.6 -0.5

Difference between Chicago and Non Urban Districts in Illinois

17.5 -2.5 12.3 5.5 8.5 6.0

Notes: The following control variables are also included in the regressions shown above: percent black, percent Hispanic, percent Asian, percent Native American, percent low-income, percent Limited English Proficient, average daily attendance, mobility rate, school enrollment, pupil-teacher ratio, log(average teacher salary), log(per pupil expenditures), percent of teachers with a BA degree, and the percent of teachers with a MA degree or higher. Robust standard errors that account for correlation within schools across years are shown in parenthesis. The regressions are weighted by the inverse square of the number of students enrolled in the school.

57

Page 58: The Impact of High-Stakes Testing on Student Achievement

Table 7.4: OLS Estimates of IGAP Reading Achievement in Illinois from 1993 to 1998

3rd Grade 6th Grade 8th Grade

1 2 3 4 5 6

Chicago*(1997-98) 14.8 (4.2)

-5.9 (2.7)

7.9 (2.8)

-9.6 (3.5)

7.9 (2.2)

2.1 (4.2)

Urban*(1997-98) 5.1 (2.2)

-5.3 (2.5)

4.3 (2.2)

8.7 (3.4)

5.3 (2.2)

4.2 (3.6)

1997-98 -6.4 (1.2)

-0.2 (1.1)

-21.6 (0.9)

-15.3 (1.1)

-20.9 (0.9)

-2.0 (0.9)

Chicago -12.5 (3.6)

-43.0 (7.4)

2.2 (3.3)

-25.1 (6.1)

14.6 (4.7)

1.3 (6.9)

Urban 6.8 (4.9)

-8.8 (6.6)

8.1 (4.9)

14.4 (7.5)

11.2 (3.5)

8.8 (7.7)

Trend*Chicago 7.2 (1.1) 5.9

(1.1) 1.6 (1.5)

Trend*Urban 3.6 (0.8) -1.4

(1.1) 0.3 (1.4)

Trend -2.0 (0.4) -2.3

(0.5) -7.0 (0.3)

R-Squared .76 .78 .77

Number of Obs 13,749 10,942 8,200

Difference between Chicago and Other Urban Districts in Illinois

14.8 -5.9 7.9 -9.6 7.9 2.1

Difference between Chicago and Non Urban Districts in Illinois

19.9 -11.2 12.2 -1.9 13.2 6.3

Notes: The following control variables are also included in the regressions shown above: percent black, percent Hispanic, percent Asian, percent Native American, percent low-income, percent Limited English Proficient, average daily attendance, mobility rate, school enrollment, pupil-teacher ratio, log(average teacher salary), log(per pupil expenditures), percent of teachers with a BA degree, and the percent of teachers with a MA degree or higher. Robust standard errors that account for correlation within schools across years are shown in parenthesis. The regressions are weighted by the inverse square of the number of students enrolled in the school.

58

Page 59: The Impact of High-Stakes Testing on Student Achievement

Table 8.1: Estimates of the achievement gain associated with greater student guessing for students who scored in the bottom 10th percentile in prior grades

Number of additional questions

answered in 1998

Additional correct responses due to guessing in 1998

Achievement gain associated with

additional correct response

Estimated gain associated with

random guessing

Actual gain

% gain attributable to increased

guessing

(1) (2) = (1) *0.25 (3) (4) = (1)*0.25*(3) (5) (6) = (4)/(5) 3rd Grade Reading .34 0.09 .14 0.01 .06 0.20Math .82 0.21 .07 0.01 .09 0.166th Grade Reading .46 0.12 .24 0.03 .29 0.10Math .99 0.25 .08 0.02 .19 0.108th Grade Reading .96 0.24 .25 0.06 .36 0.17Math 1.63 0.41 .08 0.03 .27 0.12Notes: The sample includes all first-time students in these grades in 1994 and 1998. Control variables include race, gender, race*gender interactions, age, household composition, and an indicator of previous special education placement along with up to three years of prior reading and math achievement (linear, square and cubic terms). Missing test scores are set to zero and a variable is included indicating the score is missing. Robust standard errors that account for the correlation of errors within school are presented in parentheses.

59

Page 60: The Impact of High-Stakes Testing on Student Achievement

Table 8.2: A Comparison of Eighth Grade ITBS and IGAP Exams

Math ReadingITBS IGAP ITBS IGAP

Structure

• 135 multiple-choice questions

• 4 possible answers • No penalty for wrong

answers • Five sessions of 20-45

minutes each

• 70 multiple-choice questions

• 5 possible answers • No penalty for wrong

answers • Two 40 minute

sessions

• 7 passages followed by 3-10 multiple-choice questions

• 49 total questions

• 2 passages followed by 18 multiple-choice questions

• 4 questions that ask the student to compare the two passages

• 40 questions total

Content

• Computation (43) • Number Concepts (32) • Estimation (24) • Problem-Solving (20) • Data Analysis (16)

• Computation (10) • Ratios & Percentages

(10) • Measurement (10) • Algebra (10) • Geometry (10) • Data Analysis (10) • Estimation (10)

• 2 narrative passages • 4 expository passages • 1 poetry passage

• 1 narrative passage • 1 expository passage

Format

• Computation problems do not have words.

• Data Interpretation section consists of a graph or figure followed by several questions.

• All questions are written as word problems, including the computation problems.

• One question per graph or figure.

• One correct answer per question.

• Multiple correct answers per question.

Notes: Information on the ITBS is taken from the Form L exam. Information on the IGAP is based on practice books.

60

Page 61: The Impact of High-Stakes Testing on Student Achievement

Table 8.3: OLS Estimates of the Relationship between Item Type and Achievement Gain on ITBS Math Exam from 1996 to 1998 Dependent Variable =

Proportion of Students Answering the Item Correctly on the ITBS Math Exam

(1) (2)

1998 Cohort .023 (.011)

.015 (.012)

Basic Skills * 1998 .019 (.006)

Number Concepts *1998 .019 (.010)

Estimation *1998 .006 (.010)

Data Analysis *1998 .008 (.011)

Math Computation *1998 .029 (.009)

25-35% answered item correctly prior to high-stakes testing*1998

.015 (.012)

.013 (.012)

35-45% answered item correctly prior to high-stakes testing*1998

.019 (.012)

.019 (.012)

45-55% answered item correctly prior to high-stakes testing*1998

.018 (.012)

.016 (.012)

55-65% answered item correctly prior to high-stakes testing*1998

.015 (.013)

.012 (.013)

65-75% answered item correctly prior to high-stakes testing*1998

.010 (.014)

.009 (.014)

75-85% answered item correctly prior to high-stakes testing*1998

-.002 (.014)

-.005 (.014)

85-100% answered item correctly prior to high-stakes testing*1998

.003 (.024)

-.003 (.024)

Number of Observations 692 692 R-Squared .956 .957 Notes: The sample consists of all tested and included students in 1996 and 1998. The units of observation are item*year proportions, reflecting the proportion of students answering the item correctly in that year. The omitted item category in column 1 is critical thinking skills. The omitted category in column 2 is problem-solving.

61

Page 62: The Impact of High-Stakes Testing on Student Achievement

Table 8.4: OLS Estimates of the Relationship between Item Position and Achievement Gain on the ITBS Reading Exam from 1994 to 1998 Dependent Variable =

Proportion of Students Answering the Item Correctly on the ITBS Reading Exam

Total

Intercept .004 (.021)

2nd Quintile of the Exam .002 (.014)

3rd Quintile of the Exam .013 (.015)

4th Quintile of the Exam .017 (.015)

5th Quintile of the Exam .027 (.017)

25-35% answered item correctly prior to high-stakes testing

.025 (.020)

35-45% answered item correctly prior to high-stakes testing

.036 (.019)

45-55% answered item correctly prior to high-stakes testing

.049 (.019)

55-65% answered item correctly prior to high-stakes testing

.046 (.021)

65-75% answered item correctly prior to high-stakes testing

.051 (.025)

75-100% answered item correctly prior to high-stakes testing

.043 (.030)

Number of Observations 258 R-Squared .95 Notes: The sample consists of all tested and included students in 1994 and 1998. The units of observation are item*year proportions, reflecting the proportion of students answering the item correctly in that year. The omitted category is the first quintile of the exam.

62

Page 63: The Impact of High-Stakes Testing on Student Achievement

Table 9.1: OLS Estimates of the Probability in the Following Year of Not Being Enrolled, Being Retained and Being Placed in Special Education Dependent Variables

Not Enrolled Retained

Kindergarten

1997 Cohort -0.003 (.002)

.010 (.003)

-0.001 (.001)

.001 (.002)

1998 Cohort 0.000 (.002)

.019 (.003)

0.002 (.002)

.005 (.002)

1999 Cohort 0.007 (.002)

.031 (.004)

0.008 (.003)

.011 (.003)

1st Grade

1997 Cohort -0.006 (.002)

.005 (.003)

0.013 (.003)

.013 (.003)

1998 Cohort -0.004 (.002)

.009 (.003)

0.020 (.003)

.021 (.004)

1999 Cohort 0.003 (.002)

.022 (.004)

0.024 (.004)

.024 (.005)

2nd Grade

1997 Cohort -0.002 (.002)

.009 (.002)

0.023 (.003)

.018 (.003)

1998 Cohort 0.001 (.002)

.017 (.003)

0.020 (.002)

.014 (.003)

1999 Cohort 0.010 (.002)

.030 (.004)

0.017 (.002)

.008 (.003)

Trend No Yes No YesNotes: The sample includes all first-time students in these grades from 1993 to 1999 (1993-98 for eighth grade). Control variables include race, gender, race*gender interactions, age, household composition, and an indicator of previous special education placement along with up to three years of prior reading and math achievement (linear, square and cubic terms). Missing test scores are set to zero and a variable is included indicating the score is missing. Robust standard errors that account for the correlation of errors within school are presented in parentheses.

63

Page 64: The Impact of High-Stakes Testing on Student Achievement

Table 9.2 OLS Estimates of the Relationship Between High-Stakes Testing and ITBS Scores Dependent Variable = ITBS Score in

Independent Variable Math Reading Science Social Studies Math Reading Science Social

Studies 4

th Grade

8th Grade

High-Stakes Regime .165 (.010)

.156 (.010)

-.023 (.017)

-.061 (.016)

.159 (.013)

.128 (.017)

.043 (.023)

-.036 (.023)

Notes: Each cell contains an estimate from a separate regression. The estimate reflects the coefficient on a variable indicating whether the student was part of a post-policy cohort. The fourth grade sample includes the 1996 and 1997 cohort. The eighth grade sample includes the 1996 and 1998 cohort. Both samples are limited to students who were in the grade for the first-time and were tested and included. Control variables include race, gender, race*gender interactions, age, household composition, and an indicator of previous special education placement along with up to three years of prior reading and math achievement (linear, square and cubic terms). Missing test scores are set to zero and a variable is included indicating the score is missing. Robust standard errors that account for the correlation of errors within school are presented in parentheses. Table 9.3: OLS Estimate of the Relationship Between High-Stakes Testing and IGAP Science and Social Studies Scores

Science Social Studiesth Grade 7th Grade 4th Grade 7th Grade

1 2 3 4 5 6 7 8

Chicago*(1997) 6.2 (3.6)

-7.3 (2.9)

10.1 (2.2)

0.6 (3.2)

-1.3 (4.1)

-16.4 (3.6)

7.7 (2.4)

-5.0 (4.2)

Trend No Yes No Yes No Yes No Yes R-Squared .81 .81 .83 .84 .79 .79 .79 .79Number of Observations 11,105 11,105 6,832 6,832 11,103 11,103 6,834 6,834

4

Notes: The following control variables are also included in the regressions shown above: percent black, percent Hispanic, percent Asian, percent Native American, percent low-income, percent Limited English Proficient, average daily attendance, mobility rate, school enrollment, pupil-teacher ratio, log(average teacher salary), log(per pupil expenditures), percent of teachers with a BA degree, and the percent of teachers with a MA degree or higher. Robust standard errors that account for correlation within schools across years are shown in parenthesis. The regressions are weighted by the inverse square of the number of students enrolled in the school.

64

Page 65: The Impact of High-Stakes Testing on Student Achievement

Figure 6.1: Trends in ITBS Math Scores in Gate Grades in Chicago

-0.40

-0.20

0.00

0.20

0.40

0.60

0.80

-6 -5 -4 -3 -2 -1 0 1 2 3 4 5

Years Before/After High-Stakes Testing

Dev

iatio

n fr

om 1

990

Mea

n (A

djus

ted

Ras

ch M

etric

)

3rd Grade 6th Grade 8th Grade

65

Page 66: The Impact of High-Stakes Testing on Student Achievement

Figure 6.2: Trends in ITBS Reading Scores in Gate Grades in Chicago

-0.20

-0.10

0.00

0.10

0.20

0.30

0.40

0.50

-6 -5 -4 -3 -2 -1 0 1 2 3 4 5

Years Before/After High-Stakes Testing

Dev

iatio

n fr

om 1

990

Mea

n (A

djus

ted

Ras

ch M

etric

)

3rd Grade 6th Grade 8th Grade

66

Page 67: The Impact of High-Stakes Testing on Student Achievement

Figure 6.3: Trends in ITBS Math Scores at Different Points on The Ability Distribution

0.00

0.10

0.20

0.30

0.40

0.50

1994 1996 1998

ITB

S M

ath

Scor

e (d

iffer

ence

from

199

4 m

ean)

10th Percentile 25th Percentile 75th Percentile 90th Percentile

67

Page 68: The Impact of High-Stakes Testing on Student Achievement

Figure 6.4: Trends in ITBS Reading Scores at Different Points on the Ability Distribution

0.00

0.10

0.20

0.30

0.40

0.50

1994 1996 1998

ITB

S R

eadi

ng S

core

(diff

eren

ce fr

om 1

994

mea

n)

10th Percentile 25th Percentile 75th Percentile 90th Percentile

68

Page 69: The Impact of High-Stakes Testing on Student Achievement

Figure 6.5: Trends in ITBS Math Achievement in Non-Gate Grades in Chicago

-0.2

-0.1

0

0.1

0.2

0.3

0.4

-6 -5 -4 -3 -2 -1 0 1 2

Years Before/After High-Stakes Testing

Dev

iatio

n fr

om 1

990

Mea

n (R

asch

Met

ric)

4th Grade 5th Grade 7th Grade

69

Page 70: The Impact of High-Stakes Testing on Student Achievement

Figure 6.6: Trends in ITBS Reading Scores in Non-Gate Grades in Chicago

-0.15

-0.1

-0.05

0

0.05

0.1

0.15

0.2

-6 -5 -4 -3 -2 -1 0 1 2

Years Before/After High-Stakes Testing

Dev

iatio

n fr

om 1

990

Mea

n (R

asch

Met

ric)

4th Grade 5th Grade 7th Grade

70

Page 71: The Impact of High-Stakes Testing on Student Achievement

Figure 7.1: Trends in IGAP Math Scores in Gate Grades in Chicago

150

175

200

225

250

-6 -5 -4 -3 -2 -1 0 1 2 3

Years Before/After High-Stakes Testing

IGA

P Sc

ore

3rd Grade 6th Grade 8th Grade

71

Page 72: The Impact of High-Stakes Testing on Student Achievement

Figure 7.2: Trends in Third Grade ITBS and IGAP Math Scores in Chicago

-0.40

-0.20

0.00

0.20

0.40

0.60

0.80

1.00

-3 -2 -1 0 1 2

Years Before/After High-Stakes Testing

Ave

rage

Sch

ool M

ean

(199

3 st

anda

rd d

evia

tion

units

)

IGAP ITBS

72

Page 73: The Impact of High-Stakes Testing on Student Achievement

Figure 7.3: Trends in Sixth Grade ITBS and IGAP Math Scores in Chicago

-0.20

0.00

0.20

0.40

0.60

0.80

1.00

-3 -2 -1 0 1 2

Years Before/After High Stakes Testing

Ave

rage

Sch

ool M

ean

(199

3 st

anda

rd d

evia

tion

units

)

IGAP ITBS

73

Page 74: The Impact of High-Stakes Testing on Student Achievement

Figure 7.4: Trends in Eighth Grade ITBS and IGAP Math Scores in Chicago

-0.20

0.00

0.20

0.40

0.60

0.80

1.00

1.20

-2 -1 0 1 2 3

Years Before/After High-Stakes Testing

Ave

rage

Sch

ool M

ean

(199

3 st

anda

rd d

evia

tion

units

)

IGAP ITBS

74

Page 75: The Impact of High-Stakes Testing on Student Achievement

Figure 7.5: Trends in the Difference Between Chicago and Other Urban Districts on the IGAP Math Exam

-50

-45

-40

-35

-30

-25

-20

-15

-10

-5

0-3 -2 -1 0 1 2 3

Chi

cago

Mea

n IG

AP

- Urb

an C

ompa

rison

Dis

tric

t Mea

n IG

AP

3rd Grade 6th Grade 8th Grade

75

Page 76: The Impact of High-Stakes Testing on Student Achievement

Figure 7.6: Trends in the Difference Between Chicago and Other Urban Districts on the IGAP Reading Exam

-60

-50

-40

-30

-20

-10

0

10

-3 -2 -1 0 1 2 3

Years Before/After High-Stakes Testing

Chi

cago

Mea

n IG

AP

- Urb

an C

ompa

rison

Dis

tric

t Mea

n IG

AP

3rd Grade 6th Grade 8th Grade

76

Page 77: The Impact of High-Stakes Testing on Student Achievement

Figure 9.1: Trends in Fourth Grade ITBS Achievement in Chicago

3.00

3.20

3.40

3.60

3.80

4.00

4.20

4.40

4.60

1995 1996 1997

Gra

de E

quiv

alen

ts

Math Reading Science Social Studies

77

Page 78: The Impact of High-Stakes Testing on Student Achievement

Figure 9.2: Trends in Eighth Grade ITBS Achievement in Chicago

6.5

6.7

6.9

7.1

7.3

7.5

7.7

7.9

8.1

8.3

8.5

1995 1996 1997 1998

Gra

de E

quiv

alen

ts

Math Reading Science Social Studies

78

Page 79: The Impact of High-Stakes Testing on Student Achievement

Figure 9.3: Trends in IGAP Social Studies Scores in Illinois (Difference between Chicago and Other Urban Districts)

-60

-50

-40

-30

-20

-10

0

10

20

30

1993 1994 1995 1996 1997

Diff

eren

ce (C

hica

go -

Oth

er U

rban

Dis

tric

ts)

4th Grade 7th Grade

79

Page 80: The Impact of High-Stakes Testing on Student Achievement

Figure 9.4: Trends in IGAP Science Scores in Illinois (Difference between Chicago and Other Urban Districts)

-60

-50

-40

-30

-20

-10

0

10

20

30

1993 1994 1995 1996 1997

Diff

eren

ce (C

hica

go -

Oth

er U

rban

Dis

tric

ts)

4th Grade 7th Grade

80