appropriate and inappropriate forms of testing, assessment

10.1177/0895904803260024EDUCATIONAL POLICY / January and March 2004MINDY L. KORNHABER

Appropriate and Inappropriate Forms ofTesting, Assessment, and Accountability

MINDY L. KORNHABER

Policy makers have focused on promoting test-based accountability systems asa tool for correcting a wide variety of educational problems, including lowstandards, weak motivation, poor curriculum and instruction, inadequatelearning, and educational equity. This article argues that the appropriatenessof testing, or any other form of assessment, as a solution to such problemsshould be guided by one primary motivation: whether it enables all students tofunction at the highest possible level in the wider world. This motivation pro-vides a standard for evaluating the effectiveness of assessment policies. Thearticle looks for evidence that test-based accountability systems, when appliedto various educational problems, enhance or impede diverse students’ abilityto use their minds well. The evidence is, at best, mixed. This article argues for amore balanced use of assessments that can incorporate instructionally timelyand useful information about students’ performance and concludes withguidelines and recommendations for appropriate test use.

Keywords: assessment; accountability; testing; standards

RELIANCE ON STANDARDIZED TESTING for assessment and account-ability has clearly gained favor with policy makers in the states and federalgovernment. Nearly all states have a statewide accountability policy, and thegreat majority of these policies rely heavily on testing. About half of thesepeg test results to various kinds of high-stakes consequences for students andteachers (Amrein & Berliner, 2002; Council of Chief State School Officers,2002). Under the No Child Left Behind Act of 2001 (NCLB), signed into law

EDUCATIONAL POLICY, Vol. 18 No. 1, January and March 2004 45-70DOI: 10.1177/0895904803260024© 2004 Corwin Press

45

in January 2002, all schools must test students in Grades 3 through 8, andschools will be subject to various kinds of sanctions based on test scores.

Standardized testing has been widely implemented because it is perceivedto address a broad array of problems, including low standards, weak studentmotivation, poor curriculum and instruction, inadequate learning, and educa-tional inequality (Bishop & Mane, 2001; Murnane, 2000; Orfield &Kornhaber, 2001; Ravitch, 1996; Stotsky, 2000; Taylor, 2000). Despite theirwidespread adoption, the appropriateness of standardized testing policiesremains widely debated by researchers. Whether standardized testing, or anyother form of assessment, is an appropriate policy remedy depends onwhether it minimizes or exacerbates each separate problem it is intended toaddress.

This article is organized into four sections. The first provides a telescopedaccount of the rise of current test-based accountability policies. The secondaddresses the purposes of educational assessment. The third explores whethervarious policy problems are being solved or exacerbated by test-based educa-tion reform. The article concludes with recommendations for appropriate useof testing and other assessments.

THE RISE OF TEST-BASEDACCOUNTABILITY POLICIES, STANDARDS, AND STAKES

The United States is now in its third decade of exam-driven educationreform. These reforms were initially spurred by fears of declining economiccompetitiveness, low standards in schools, and weak student motivation(National Commission on Excellence in Education, 1983). Such concernsstill animate calls for increased testing (Bishop & Mane, 2001; Murnane,2000; Ravitch, 1996; Stotsky, 2000). Furthermore, they have helped moveeducation policies, and assessment in particular, from tangential political andmedia priorities in the early 1980s (Dorn, 1998)—recall the Reagan adminis-tration’s desire to abolish the Department of Education—to a central plank inthe platforms of many state and national office seekers (Kornhaber & Orfield,2001; see also Linn, Baker, & Betebenner, 2002). The political momentumbehind testing knows no party boundaries either at the state or federal level(Kornhaber & Orfield, 2001; National Governors Association, 2001). Wit-ness Arkansas’s then-Governor, Bill Clinton, chairing the 1989 EducationSummit. Out of that summit emerged President George H. Bush’s educationplan, America 2000, a key aim of which was to develop new tests tied toworld-class standards (Goals 2000: Educate America Act of 1994). Clinton’sown presidency was notable for its controversial, and ultimately unsuccess-

46 EDUCATIONAL POLICY / January and March 2004

ful, advocacy of a voluntary national exam in which states or districts couldchoose to test fourth-grade reading and eighth-grade math. Governor GeorgeW. Bush’s 2000 presidential campaign put Texas’s high-stakes testing frontand center. Having won the White House, a modified Texan approach toaccountability—including testing of all students in Grades 3 through 8—hasbecome the model for the entire nation under the NCLB. Under NCLB, stateswill also have to participate in the National Assessment of Educational Prog-ress’s (NAEP) fourth- and eighth-grade-level assessments of reading andmathematics. Schools will have to show through testing that achievementis going up for all students and for those across lines of color, disability,income, and English proficiency.

Given the perceived link between test scores and economic health, it is notsurprising that business leaders have been very supportive of test-basedreforms. It is true that those in business have long wanted schools to be heldaccountable for the use of tax resources and to prepare students for the work-place (Callahan, 1962; Tyack, 1974). However, business’s push for usingstandardized tests to hold schools accountable is a more recent phenomenon(see Dorn, 1998). The rise of standardized testing across the states is in part aresponse to concerns by local economic actors—businesspeople and politi-cians—about their state’s ability to retain and attract businesses. Such con-cerns mirror those voiced at the national level, harkening back to the 1983federal report A Nation at Risk, which asserted that the economy is threatenedby poor student and school performance (National Commission on Excellencein Education, 1983).

Alongside such concerns, test-based accountability accords well withstandard business thinking. It is quantitative, emphasizes bottom-line fig-ures, establishes clear targets en route to that bottom line, and relies on in-centives or disincentives to meet those targets. The “new accountability”(Elmore, Fuhrman, & Abelmann, 1996) found across the majority of statesincorporates this thinking with its “primary emphasis on measured stu-dent performance” and “the creation of systems of rewards and penalties”(Elmore, Fuhrman, & Abelmann, 1996, p. 5). Not surprisingly, test-basedaccountability models have received widespread support from the businesssector. Business leaders, most notably Lew Gerstner, the longtime chairmanof IBM, have taken active roles in promoting test-based systems (Gerstner,2002). Business-based organizations such as the National Alliance of Busi-ness (2002) and the Business-High Education Forum (2002) have beenestablished or rededicated to promote higher standards and test-basedaccountability systems (Platzer, Novak, & Kazmierczak, 2002). In fact, theorigins of our new national model of testing and accountability model of

MINDY L. KORNHABER 47

NCLB might be found in H. Ross Perot, the self-made billionaire and formerpresidential candidate, who introduced these ideas into Texas’s educationalsystem in 1984 (Haney, 2001).

Even as standardized testing has become prominent on the political, busi-ness, and educational scenes, the standards against which students andschools are judged have become more demanding. In the late 1970s and early1980s, the focus was on implementing tests of basic skills or “minimumcompetency tests” (Bishop & Mane, 2001; Dorn, 1998). A Nation at Risk(National Commission on Excellence in Education, 1983) criticized suchtests for setting a low standard. Within 3 years of that report’s publication, 35states adopted more challenging standards, often accompanied by more chal-lenging tests (Pipho, 1986). In the early 1990s, the ante was again raised withcalls for “world-class standards” and tests that could measure students andschools against such standards (Ravitch, 1996; Resnick & Nolan, 1995).

The call for higher standards has made its way into the NCLB. By the endof the 2013-2014 school year, NCLB requires all students to score at or abovethe proficient level established by their state. What this may accomplish inreality is hard to grasp because states currently have a wide variation in whatlevel of achievement rates as proficient and perhaps an even wider variationin the percentages of students that obtain that designation (Linn et al., 2002).However, one way to gauge the push toward proficiency that NCLB demandsis by considering its mandated point of reference for the statewide tests: theNAEP.

Recent NAEP mathematics results reveal that 26% of fourth graders and27% of eighth graders are proficient or higher. In reading, 32% of fourthgraders and 34% of eighth graders reach proficient or higher levels (NationalCenter for Education Statistics, 2003). Consequently, NAEP’s designation ofproficient has been seen by some as too demanding (Hambleton et al., 2000;Linn et al., 2002). However that category is delineated, the requirement thatschools will move all students toward proficiency, if proficiency is to retainanything resembling its meaning in common parlance, is a standard that is, atbest, extremely challenging.

Policy makers not only have raised the standards for acceptable test per-formance but also the consequences (or stakes) associated with test results. Inthe past, test scores were used in whatever way school and district bureau-crats felt reasonable; sometimes little attention at all was paid to them (Dorn,1998). Such discretion has been lifted out of local decision makers’ handsby state and national policy makers. In some 15 states, policy makers havedemanded that test scores must be used in considering whether to award ahigh school diploma; in North Carolina, Florida, Georgia, and Louisiana,school personnel must use test scores to decide whether to retain a student in


grade (see Amrein & Berliner, 2002). The stakes have risen for schools aswell as for students because test results are no longer simply matters for inter-nal perusal but are now made publicly available in newspaper rankings andon many Web sites. Curious citizens can see how their school’s scores com-pare to others or how their district stacks up. Under NCLB, scores will beused to determine whether adequate yearly progress has been met and whatsort of consequences will ensue. For example, schools that receive Title Ifunding but fall short of adequate yearly progress for 2 consecutive yearsmust provide students with other public school choices. Those that fall shortfor 3 consecutive years must provide students with additional services, suchas tutoring. Such sanctions will further stretch the already-thin resourcesof schools whose populations include a large percentage of low-scoringstudents.

Even as calls for more testing and higher standards have broadened, thedebate over types of assessment has narrowed. Educational assessment cantake a wide variety of forms (see, e.g., American Educational Research Asso-ciation [AERA], American Psychological Association, & National Councilon Measurement in Education, 1999; Pellegrino, Chudowsky, & Glaser,2002). These range from standardized tests to various forms of authenticassessment. The latter are so named because they are meant to mirror prac-tices and performances that actually occur within a discipline rather thanthose found in the context of a test (Black & Wiliam, 1998; Wiggins, 1998).From the mid-1980s into the late 1990s, there was a lively conversationamong education scholars, policy makers, and educators about the forms thatassessment should take to improve learning (see, e.g., Darling-Hammond,Ancess, & Falk, 2001; Koretz, Klein, McCaffrey, & Stecher, 1994; Pick,2000; Sizer, 1984, Wiggins, 1998; Zessoules & Gardner, 1991). This wasaccompanied by a great deal of experimentation among practitioners withportfolios, exhibitions, and other forms of authentic assessment.

Such conversations and experiments have been very largely subdued(Elmore, 2002). This may be tied to challenging technical hurdles. For exam-ple, it is much harder to get reliability in authentic assessments (Koretz &Barron, 1998). Partly, it is due to cost, which is substantially higher thanmachine-scored standardized tests. In addition, there are court decisions andpolitical showdowns with powerful superintendents that favor standardizedtests over other assessments. To illustrate, the New York Performance Stan-dards Consortium, a network of more than 30 schools that has a strong trackrecord of success in serving traditionally disadvantaged students, petitionedto be exempted from the state’s Regents exams. The schools wanted to con-tinue to rely on alternative assessments rather than adapt instruction andassessment to the extremely broad coverage demanded by the Regents. Nev-


ertheless, state superintendent Rick Mills refused to exempt them (Winerip,2003b), preferring to have uniform accountability for all schools across thestate. Such difficulties have muted the public role and worth of authenticassessments. Thus, the reign of test-based accountability, high standards, andhigh stakes has now been consolidated.

PURPOSE OF ASSESSMENT

The political climate now favoring test-based assessment and account-ability has obscured the nature of educational assessment. Educationalassessment—evaluating what students know and can do—is undertaken for agreat variety of useful reasons. Many of these reasons are spelled out in the“bible” of assessment, The Standards for Educational and PsychologicalAssessment (Joint Standards) (AERA et al., 1999; see also Heubert &Hauser, 1999). Assessment is conducted to help determine students’existingknowledge base. It is used to diagnose disabilities that may be impeding anindividual’s learning. Assessment is employed to help award placements inprograms of limited availability, such as programs for the gifted or gover-nor’s schools for science and math. It is used to certify mastery of a particularlevel of learning, as is the case with advanced placement tests.

Teachers conduct assessments in part to inform instruction: Essays thatreveal little awareness of the objective case or a quiz evidencing scant knowl-edge of Newton’s laws provide information to shape classroom instruction.Policy makers also use assessment to influence classroom practice. A man-dated test of state history will spur teachers to emphasize that topic. Clearly, aprominent purpose of assessment in these times is to promote public account-ability for educational effectiveness. That is, are teachers, administrators,schools, states, and districts using the tax dollars to produce learning?

Although this list of purposes is lengthy, it nevertheless leaves unarticu-lated what I argue should be educational assessment’s fundamental and over-arching motivation: Assessment should serve as a tool to enhance all stu-dents’ knowledge, skills, and understanding so that they can function at thehighest possible level in the wider world (Kornhaber, in press). This motiva-tion for assessment extends beyond simply advancing learning. This laud-able but abstract aim was recently put forward by the Committee on the Foun-dations of Assessment of the National Research Council (Pellegrino et al.,2002).

If the aim of educational assessment stops at the point of “advancinglearning,” then the tendency will be to tether learning to results on readilyavailable assessment tools. Given that testing has become the assessment toolof choice, advancing learning will too easily become equated with advancing


test scores. Alas, as will be discussed below, there is no necessary relation-ship between higher test scores and the capacity to put the tested knowledgeto use. Some researchers call this application of knowledge and skills “trans-fer” or “generalizability” (see Pellegrino et al., 2002; Perkins & Salomon,1988; Salomon & Perkins, 1989). Others call it “understanding” (Gardner,1999). Assessment should support this; it must enable students to functionwell not just on a test or in school but beyond those restricted spheres.

Articulating such a motivation cannot inoculate the educational processagainst inappropriate uses of assessment. However, it does set a clear stan-dard against which to judge assessment policies, the procedures used in theirimplementation, and their effectiveness: If assessment policies and proce-dures work against enabling students to function at a high level in the realworld, they should be deemed ineffective no matter what scores are attained.This motivation and the standard it sets are different than those now operativeamong state and federal policy makers in which attaining particular testresults or a given adequate yearly progress has become the primary goal ofboth assessment and the larger education process.

IS TESTING SOLVING THE PROBLEMSIT IS INTENDED TO SOLVE?

The Underlying Theory of Test-Based Reform

Policy makers have adopted testing as a favored method of assessment inpart because it is believed to constructively address a broad range of educa-tional problems. These include low standards, weak motivation and expecta-tions, and inadequate curriculum and instruction. Such problems do under-mine the extent to which students can learn to function in the wider world.The literature on expertise reveals that high standards, high motivation andeffort, and capable teaching are important elements in enabling high-levelperformance in a broad range of real-world disciplines (Chase & Simon,1973; Chi, Glaser, & Farr, 1988, passim). Disparities in standards, expecta-tions, and teaching clearly exist across races and classes and contribute toeducational inequality and gaps in achievement (e.g., Baron, Tom, & Cooper,1985; Delpit, 1995; Ferguson, 1998; Howard & Hammond, 1985).

The belief that testing can solve this range of very real problems rests onan often unarticulated, logical, and rather behaviorist theory. Under this the-ory, U.S. educational performance is suboptimal partly because standards arelow and poorly articulated. Therefore, higher standards must be clearly andpublicly spelled out. To find out whether students and educators are focusingon reaching these standards, students will be tested. To make sure the new


standards and tests motivate teacher and student effort, test results will carryconsequences or stakes. The consequences can be rewards (e.g., good public-ity in the newspaper, bonuses for educators and schools) and/or punishments(e.g., bad publicity, dismissal of school staff, student retention or diplomadenial). To avoid punishments and get rewards, students and teachers willwork harder and as a result, students will learn more and be better preparedfor the workforce. Under NCLB, all states are now required to adhere to thistheory. Otherwise, states themselves face a serious stake—forgoing federaltax dollars. The underlying logic of this theory is clear, but it is important toconsider how well it plays out in practice.

The Problem of Low Standards

Tests can clearly influence educational standards. Basic skills tests aresupposed to establish a floor, but they have acted much more as a ceiling forwhat will be taught and learned. The move to higher or world-class standardsis aimed at markedly raising the standards for what is taught and learned(Bishop & Mane, 2001; Resnick & Nolan, 1995).

There are, however, several problems in using testing to push higher stan-dards into schools. One of these is that when the standards are set far awayfrom the school, they do not necessarily get into the classroom or get into theclassroom in the expected ways. In many state systems, groups of educatorsand subject-matter experts are formed to forge higher standards. These may,in turn, draw on standards set by groups such as the National Council ofTeachers of Mathematics and the National Council of Teachers of English.The higher standards states produce can be voluminous. For example, inMassachusetts the standards in one subject area alone can run well beyond100 pages (Massachusetts Department of Education, 2003). These new stan-dards are then disseminated to districts and schools.

Yet this does not mean districts and schools “get” the standards. In a studyof the uptake of standards by educators in schools, Blank, Porter andSmithson (2001) found that the standards are often not adopted and used inthe classroom. It is unclear why the standards are slow to seep in. Part of itmay be that the standards are too detailed and lengthy in some cases for edu-cators to use. Human minds, especially those juggling the diverse tasks pres-ent in most classrooms, seek far fewer, manageable chunks of information(Miller, 1956). Another element may be that the standards that are set far out-side the classroom by committees of teachers and subject-matter experts maynot mesh with local concerns of schools and teachers (Meier, 2000). In addi-tion, it is possible that standards are slow to kick in because, under the prece-dent set by Debra P. v. Turlington (1979), the consequences associated withtest scores can take effect only several years after the introduction of new


tests.1 This lag between testing and its consequences may delay the class-room adoption of the standards that the tests are supposed to assess. If stan-dards are not being adequately adopted at the classroom level, it is highlyunlikely that students are systematically learning the standards despite whatthe theory of test-based reform sets forth.

The Problem of Weak Student Motivation

By clearly spelling out what it is that all students must learn, and distribut-ing consequences to students depending on their success or failure to learn,test-based reform is supposed to motivate students to work harder. Neverthe-less, the uniformity of demands and consequences will not equally motivateall students. Students, both as individuals and by identifiable subpopulations,have different motivations to work hard in school.

Some researchers have argued that African American, Latino, and NativeAmerican students show differences from White middle-class students intheir motivation for academic work (Ferguson, 1998; Garrison, 1989;Mickelson, 1990; Ogbu, 1978, 1991; Valenzuela, 1999). Ogbu (1978, 1991)has found that students from immigrant groups that have voluntarily mi-grated to a new country in search of a better life tend to have higher academicachievement relative to their majority group. Such “voluntary minorities” aremotivated to work hard in school because they believe that academic achieve-ment will enable them to overcome social barriers and become successful.However, historically it has been the case that students from groups that havebeen enslaved or otherwise assigned to the bottom of society have not foundschool achievement linked to social success. These “involuntary minorities”tend to have reduced motivation to work hard in school and have reducedachievement.

In addition to this dynamic, Fordham and Ogbu (1986) asserted that inresponse to oppression, African Americans have developed a collective iden-tity and culture. To maintain one’s standing within this collective culture,some African Americans will reject activities, including schoolwork, thatthey regard as characteristic of the White majority. Schooling can some-times be viewed by students from involuntary minority groups as a “sub-tractive process” that effaces their identity and culture (Fordham & Ogbu,1986; Garrison, 1989; Valenzuela, 1999). Achievement in school is then seenas threatening their social ties with others in their group. This, in turn,reduces motivation and achievement.

Ogbu’s position has not gone unchallenged. Findings from some empiri-cal investigations claim there are no differences in motivation or effort byAfrican Americans (Spencer, Noll, Stoltzfus, & Harpalani, 2001) and con-trolling for social background variables, claim that successful Black students


do not feel socially alienated any more than White students. They also feelmore popular than less successful Black students (Cook & Ludwig, 1998).Ogbu’s work has also been critiqued in part for not recognizing that AfricanAmericans have historically prized education (Cross, 1995). In addition,some believe that Fordham and Ogbu’s (1986) findings primarily reflect theresults of extreme economic and social segregation of some African Ameri-cans in the inner city (Cross, 1995). Nevertheless, more recent research byFerguson (2002) that focuses on African American students in integratedschools lends some support to Fordham and Ogbu’s conclusions and findsthe residue of oppositional culture operative even among African Americanstudents living in suburbia (see Winerip, 2003a).

Oppositional culture is not the only explanation for depressed motivationand lower achievement. However, to the extent that it is operative, test-basedpolicies offer no solution to the problem. Other sorts of solutions for lowmotivation have been suggested by Ogbu and his colleagues. The first ofthese is to reduce economic disparities that lead some minority students todisparage the utility of academic achievement (Fordham & Ogbu, 1986).This suggestion is supported by evidence that the largest reduction in theBlack-White achievement gap came during the period of greatest enforce-ment of civil rights laws (Levin, 2001; see Jencks & Phillips, 1998). In addi-tion, many other researchers have argued that teachers in culturally diverseclassrooms may not succeed by simply dispensing skills (e.g., Banks, 1994a,1994b, 2001; Delpit, 1995; Ladson-Billings, 2001a, 2001b; Valenzuela,1999). Knowledge and skills are more likely to make their way into the mindsand hearts of students on the backs of interactions with teachers who under-stand, and can work with, cultural differences. Testing will likely influencesome students, especially those not already disenfranchised and thosealready close to the passing grade. For others, it may not spur a level of com-mitment needed to succeed on the exam.

The Problem of Poor Curriculum and Instruction

Proponents of high-stakes testing claim that the new state and federal poli-cies will bring a renewed focus on substantive curriculum. This focus, theyassert, has been especially weak in schools that serve primarily AfricanAmericans, Latinos, or students from poverty. Such students have not hadadequate access to challenging curriculum (Bishop & Mane, 2001; Ravitch,1996; Stotsky, 2000). Testing advocates believe tests will bring about moreequal curriculum by making educators across the state teach what is going tobe on the test. Because the tests draw on higher standards, they will also be“worth teaching to” (Simmons & Resnick, 1993; Spalding, 2000; Viadero,1994).


In theory, all of this makes sense. However, it does not always play outquite so sensibly in practice. For example, teaching to the test sometimesyields an overemphasis on rote instruction. An investigation into Chicago’stest-driven school reform found that “the current school system leadershiphas strongly emphasized drill in preparation for the test. . . . Furthermore, thecurriculum options that they propose to mandate for low achieving schoolsplace an overriding emphasis on teacher-directed drill” (Moore & Hanson,2001, p. 6). An overemphasis on drill is known to reduce students’ ability touse knowledge outside the context of a given school or exam setting (Perkins& Salomon, 1988; Salomon & Perkins, 1989).

An increased reliance on drill and rote instruction among low-achievingschools is not unique to Chicago. Other researchers have found this pattern inlarge districts in Texas, New Jersey, and other states (Firestone, Camilli,Yurecko, Monfils, & Mayrowetz, 2000; Hillocks, 2002; McNeil, 2000;McNeil & Valenzuela, 2001). Low-achieving schools are typically thosewith high concentrations of students from poverty and high concentrations ofLatino and African American students. Rather than making rich curriculummore widely accessible to such students, it appears that test policies may nar-row the curriculum that is offered to them. McNeil and Valenzuela (2001)found that in several schools serving poor and minority students in Texas,students learned to write only the same kind of essay that the state testrequired. They were asked to focus their reading on short passages that paral-leled the format of the state’s test. Over time, students in these schoolsbecame less able to read extended works of literature. Content was also lost inthe subject areas that were not tested because time was taken from teachingthose subjects to teach more of the subjects that the state tested (McNeil &Valenzuela, 2001).

Several investigators have found a more positive picture. For example,Bishop and Mane (2001) found that schools in New York State that beganvoluntarily implementing universal requirements that students must take theRegents exam generated greater teacher effort and student learning. Skrla,Scheurich, and Johnson (2000) argued that once several districts in Texasbegan focusing on the Texas Assessment of Academic Skills Test (TAAS),they mustered a range of supports that enabled a variety of achievementgains. Part of the debate among researchers over the impact of test-focusedcurriculum is due to differences in the studies’ sampling, methodology, andchoice of outcome variables. For example, Bishop and Mane found that grad-uates of test-focused schools in New York experienced statistically signifi-cant increases in earnings. However, the gain in earnings still left studentsjust barely above the poverty line (Kornhaber & Orfield, 2001). A boost inincome to that level supports neither the idea that the tests induced better cur-


riculum and instruction nor that the tests provided a worthwhile system ofeducational assessment: one which enabled students to use their minds wellin the wider world.

The Problem of Inadequate Learning

Even if testing policies narrow the curriculum and steer instruction towarddrill, testing proponents argue that students, especially those in high-povertyschools, still need to learn the knowledge and skills the tests sample. How-ever, curricula that focus heavily on test preparation often do not foster stu-dents’ learning. That is, such reforms may not further their ability to actuallyuse English or apply mathematical concepts and skills (Elmore & Rothman,1998; Klein, Hamilton, McCaffrey, & Stecher, 2000; Koretz & Barron, 1998;Linn, 2000; Neill, 2001; Pellegrino et al., 2002). This phenomenon may behard for policy makers to recognize, because test scores typically rise afterthe introduction of a new testing program (Linn, 2000). Such increases maygive the appearance that new testing policies are advancing learning.

However, test scores can rise for a variety of reasons unrelated to learning.Many of these reasons fall into the category of “gaming.” That is, scores canbe improved by manipulating conditions unrelated to teaching and learningwhat makes for good work in a discipline. Gaming techniques include chang-ing the pool of test takers to weed out those who are struggling. NCLB isresponding to this problem by requiring high percentages of test participa-tion. However, one way the pool is gamed entails retaining students in gradeso that they are not in the testing pool with their age cohort and have moretime to prepare for the exam. Therefore, in some states, including Texas andMassachusetts, there is a sizable bump in the percentage of students who areretained the year before the states’ high-stakes exams are given (see, e.g.,Haney, 2000, 2002).

Unfortunately, the extra year that students are given to “catch up” is asso-ciated with increased drop-out rates but not with increased learning (Hauser,2001; Holmes, 1989; Moore, 2000). Texas, which has become renowned forits improved test scores, is also home to several districts that have among thehighest drop-out rates in the entire country (Haney, 2001). Houston Inde-pendent School District, which had been heralded as a model of urban educa-tion, was recently shown to have drop-out rates close to 40%—not in the lowsingle digits as had been claimed (Schemo, 2003). Houston appears to havegamed these figures in part by following then–Superintendent Rod Paige’sinstruction to reevaluate what was classified as dropping out (Peabody,2003). A similar situation obtains in New York City’s schools, where effortsto meet accountability ratings are undermining students’ chances of receiv-ing a high school diploma:


As students are being spurred to new levels of academic achievement and required topass stringent Regents exams to get their high school diplomas, many schools are try-ing to get rid of those who may tarnish the schools’ statistics by failing to graduate ontime. Even though state law gives students the right to stay in high school until they are21, many students are being counseled, or even forced, to leave long before then.(Lewin & Medina, 2003, p. 1)

In addition to altering the pool of test takers, gaming techniques includeteaching test-taking skills, such as how to bubble in an answer sheet asquickly as possible or how to eliminate choices. Both of these may be usefulon a test but have exceedingly little to do with actually learning disciplinaryskills or content. It goes without saying that there is also the problem of gam-ing through outright cheating. In essence, under test-driven systems, practicemay be inappropriately warped to make the scores look good whether or notthe underlying teaching and learning are good. Several leading researchers intesting and assessment have argued that the higher the stakes associated withthe test, the more likely it is that the test will be gamed (Cizek, 1998; Haney,2000, 2001; Linn, 2000; Madaus & Clarke, 2001).

To know whether students are learning, it is important to ascertainwhether they can apply the knowledge and skills that are called for on one testto some other context. Ideally, educational assessment should enable stu-dents to function at a high level beyond the contexts of tests and schools.However, the usual standard that educational researchers use for ascertaininglearning is more basic. Typically, researchers are satisfied if students canapply knowledge and skills by showing that scores also increase on a differ-ent test of the same content. For example, do math scores go up both on astate’s test and on the math portion of the NAEP?

The research that has investigated how well students’gains on tests gener-alize to other tests of the same content does not lend much support to the ideathat test-intensive reforms are promoting learning. An investigation by Neill(2001) uncovered no clear association between gains on the NAEP and thepresence or absence of a state-mandated high school graduation test. Statesthat relied on high-stakes testing did not necessarily show NAEP gains, evenwhen scores rose on their own state tests. Koretz and Barron’s (1998)research on Kentucky’s Instructional Results Information System Test(KIRIS) revealed that the improvement on the state’s own test in the early1990s was from 3.5 to 4 times greater than its improvement on NAEP. Thisheld true even though KIRIS was supposed to assess the same content andskills found on NAEP. Koretz and Barron also compared gains at the highschool level with gains on the ACT and once again found much larger gainson the KIRIS. Amrein and Berliner (2002) investigated the 18 states with the


highest stakes testing programs and found only scant indications of general-ized gains in learning on the NAEP, college entrance examinations, oradvanced placement tests.

Texas’s system of testing, now the model for the rest of the nation, hasreceived much attention for its test-score gains. But evidence that an educa-tional miracle has unfolded there may rest more in the minds of the reform’sbelievers than in the minds of the state’s students. Among the believers’ evi-dence is a RAND Corporation study in which Texas and North Carolina arehighlighted for manifesting the largest average NAEP gains between 1990and 1996. In this study, Texas’s African American and Latino students havescore increases that were larger than the state’s White students (Grissmer,Flanagan, Kawata, & Williamson, 2000). Yet another RAND team studiedTexas’s fourth- and eighth-grade mathematics and reading results on NAEPand found that Texas outranked other states just on the fourth-grade math. Ineighth-grade math and on fourth- and eighth-grade reading, “the gains inTexas were comparable to those experienced nationwide” (Klein et al., 2000,p. 12). The much-heralded narrowing of the score gap on Texas’s own statetest was also called into question. These researchers discovered that between1994 and 1998, the Black-White gap in Texas’s NAEP scores actuallyincreased in fourth-grade reading and in fourth- and eighth-grade math. TheNAEP score gap between Latinos and Whites also grew slightly wider. Thesegaps grew even as Texas declared gaps to be narrowing on its own stateassessment (Klein et al., 2000).

If students can barely apply knowledge from one test to another test of thesame material, it is highly unlikely that they can actually use this knowledgein some real-world context, such as work. Such findings undermine one keyargument for increased state testing: to prepare a better educated workforce.It is even possible that testing will make it more difficult for students to takeknowledge gained in school and use it in other contexts. Transferring, or gen-eralizing, knowledge from a given classroom or test to other contextsrequires that students have opportunities to use what they learn in a variety ofways (Gardner, 1992; Pellegrino et al., 2002; Perkins & Salomon, 1988;Salomon & Perkins, 1989). Because of the serious consequences that arebuilt into many state testing policies and into NCLB, teachers and students—especially those in high poverty and high minority schools—will focus onusing knowledge primarily for just one context: the test. Test-driven reformswill naturally tend to reduce teachers’efforts to provide students with a rangeof arenas in which to apply what they have been taught.

In sum, the theory behind test-driven reforms, described above, is persua-sive and politically powerful. However, testing systems, especially high-stakes ones, are extremely unlikely to provide the nation with answers to


many of its educational concerns. These systems appear inadequate forbringing in high standards of disciplinary work, in part because such stan-dards are too voluminous to be sampled by a test or used by teachers. Instead,educators’rational choice is to teach to the demands of the test rather than thedemands of the academic disciplines. Furthermore, teaching that overem-phasizes test improvement is not likely to improve learning because studentswill not ultimately be well equipped to transfer what they have learnedbeyond the test. Nor will test-driven systems undo inequity in education: Themore students are at risk of scoring poorly, the greater will be their exposureto excessive test preparation and drill (McNeil & Valenzuela, 2001) and theless likely they will be to apply what they are being taught.

Testing is also inadequate to address motivational issues, except perhapsamong students who are closest to attaining passing performances and thosewho are not disenfranchised from White middle-class aspirations. Thisleaves out students who are most at risk. Those students’scores may even be-come a liability in the accountability system because they will make schools’scores look bad (Haney, 2000, 2001; Lewin & Medina, 2003). Therefore,another rational, if unintended, consequence of test-based accountabilitywill be to retain such students or push them out, a practice counter to advanc-ing their learning. Given all this, it is also highly unlikely that test-focusedreforms will yield increased learning in general or increased equity in theability of diverse students to use their minds well (Kornhaber, in press).

Standardized testing does have strengths: It is less expensive per pupilthan other forms of assessment and other kinds of school reforms (includingreducing class size or educating and retaining excellent teachers). Testing isdesigned to yield reliable results, a necessary but insufficient basis for mak-ing valid inferences about educational performance. It can generate a sum-mary statistic that facilitates a range of comparisons across students andschools. It can wield considerable influence across the educational system,creating a sense that politicians are addressing educational needs (Natriello& Pallas, 2001). However, policy makers’ reliance on standardized testing isan inadequate solution to the key educational problems to which it is nowbeing applied.

GUIDELINES AND RECOMMENDATIONSFOR APPROPRIATE TEST USE

To the extent that NCLB remains in effect, testing will play a prominentrole in state and federal educational policy. Given this, it is important to makesure that testing is appropriately used. In the absence of appropriate use, it isunreasonable to use test scores to make inferences about the performances of


students or the effectiveness of schools or policies. Appropriate educationalassessment rests on several practices and procedures that have been spelledout in Joint Standards (AERA et al., 1999) and underscored by the NationalResearch Council (Heubert & Hauser, 1999). These practices include usingmore than a single test result in decision making, examining whether therehas been equitable treatment in testing, and ascertaining whether there hasbeen a reasonable opportunity to learn.

Reliance on More Than a Single Test Score

One of the key recommendations for the appropriate use of testing is thatno one test score should serve as the sole basis for making decisions of conse-quence (AERA et al., 1999; Heubert & Hauser, 1999). In other words, thedecision to retain students, assign them to remedial classes, deny studentstheir high school diploma, withhold school funds, or reconstitute a schoolstaff should not be based on the results of a single test. No test is a perfectlyreliable indicator of a test taker’s knowledge and skill. Despite a great deal ofeffort by test developers, all tests are subject to error of measurement.

Because testing is imperfect, districts and states generally allow studentsto retake high-stakes tests that they have failed. From a psychometric per-spective, educational assessment based on multiple tests reduces measure-ment error and therefore provides a better basis for decision making thandoes a single test (Gong & Hill, 2001). However, from an instructional van-tage point, repeated testing is problematic. It is likely to steer teaching andlearning toward drill rather than enabling students to use their minds well inthe wider world.

Although test-driven education reform has proved powerful to policy mak-ers, authentic assessment offers another equally logical approach to intro-ducing standards, motivating students, and improving curriculum, instruc-tion, and learning. In a system of authentic assessment, students acquire highstandards by studying examples of high-quality work in different disciplinesand the practices that support it. The standards need not be set by some distantcommittee but can be driven by works or issues that are compelling withinparticular communities. This makes it much more likely that the standardswill become known and “owned” by students and teachers. The standards arealso known because students and teachers clearly articulate the characteris-tics that make for high quality in the works they study. For example, the char-acteristics for a strong essay might include a compelling opening thesis, asolid presentation of positions for and against the thesis, a strong organiza-tion, and highly polished use of language. Students then engage in ongoingefforts to produce work that embodies these characteristics. In the process,they receive regular and formative feedback from teachers and classmates.


This feedback can be much richer than test-score results and much more use-ful for inducing learning (cf. “Your opening thesis is strong, but you nowhave little in the way of counter argument” vs. “81st percentile” or “basic”).Unlike standardized testing, the feedback in authentic assessment is alsodelivered while students are doing the work or very soon thereafter. Thisenables students to use the feedback to improve their performance.

Authentic assessment has been widely and successfully used in individualschools and classrooms (Black & Wiliam, 1998; Darling-Hammond, 2001;Wiggins, 1998). However, like standardized tests, authentic assessment alsohas weak points. For example, authentic assessment takes much more time.This makes it a more costly and less efficient source of data gathering. Reli-ability in scoring is harder to achieve (Haertel, 1999; Koretz et al., 1994).Because the standards are more locally developed, there will be considerablevariation in “quality work.” Some standards and work will fall markedlyshort, but unlike test-based accountability systems, low-level work may beless readily detected. This is partly because, unlike testing, such assessmentsdo not readily yield a summary statistic that can be easily used for compara-tive and analytic purposes. In essence, authentic assessment is mismatched tothe demands of systemwide accountability, which is one reason why statesystems are dominated by testing.

Clearly, the strengths and weaknesses of performance-based assessmentand test-driven systems are complementary. Given the weaknesses inherentin either system of assessment, a rigid adherence to one form or the other isproblematic. Our current policies have relied too heavily on testing, whosedownsides are becoming increasingly evident as drop-out and retention fig-ures emerge. There is now a need for a pendulum correction. Standardizedtesting can be used to audit, but it is not well matched to the tasks of teachingand learning. Authentic assessment cannot be readily used to audit, but itcan enhance teaching and learning (Black & Wiliam, 1998; Eisner, 1999;Pellegrino et al., 2002; Wiggins, 1998). Using both forms of assessment,rather than sticking to the repeated use of one form, would more construc-tively follow the spirit of the Joint Standards’guideline to use more than onetest in decision making. In addition, greater incorporation of authenticassessment into state policies should facilitate knowledge and skill develop-ment. Oddly enough, this might then create more test-based evidence oflearning.

Equitable Treatment in the Testing Process

Another crucial element in appropriate test use is equitable treatment inthe testing process. In part, this means that testing conditions ought to becomparable across individuals and also across groups of test takers. Thus,


tests are standardized with regard to materials, time, and instructions. Withstandardized conditions and procedures in place, it is more reasonable toinfer that a test score reflects what a student knows and can do rather thanrandom events.

Despite efforts to provide equitable treatment, standardization can besomewhat elusive. Kane and Staiger (2001) pointed out that testing situationsdiffer due to distractions such as disruptive students, dogs barking, and otherevents beyond the control of test companies. In addition, there are many situ-ations that demand a relaxation or elimination of standardized procedures.For example, students who have learning disabilities are legally entitled toreceive modifications of the standard procedures. Thus, a blind studentshould be provided with a reader or Braille materials. These modificationsare intended to improve equitable treatment by enabling students—despitetheir disabilities—to demonstrate their command of the underlying concepts.

Under some recent state policies and under NCLB, students with veryserious learning differences/disabilities are required to be tested. Yet, theextent of modifications that may be needed for a student with severe disabili-ties may impede the chances that reasonable inferences can be made from testscores about what students know and can do (McDonnell, McLaughlin, &Morison, 1997). A policy of testing all students, despite their disability,may be aimed at providing equity, but it collides with one of the key techni-cal claims of testing: providing a basis for valid inferences. Because validinference making is difficult when disabilities are great and because testingmay be inappropriate in some instances, students with serious disabilitiesand their parents should be given a great deal of leeway regarding testparticipation.

Students who do not have a good command of English are also increas-ingly being required to participate in state testing. Again, it is believed thatthese students are all too easily marginalized if they are not included in theaccountability system. Yet, as a National Research Council report noted,“The demands that full participation of English-language learners make onassessment systems are greater than current knowledge and technology cansupport” (Heubert & Hauser, 1999, p. 296). It is exceedingly difficult to makereasonable inferences about what a student knows and can do with conceptsin mathematics, science, or writing if that student is asked to perform in a lan-guage in which he or she is not yet fully competent. These students shouldalso be afforded accommodations to the standardized procedure, such as testsin their own language or tests that reduce the English-language demands(Heubert & Hauser, 1999).

Equitable treatment extends beyond issues of how tests are administered.It includes the notion that students should have equal access to any materials


that test companies or states distribute to help students prepare for the test(AERA et al., 1999). The idea of equal access becomes complicated when weleave the narrow confines of instruction booklets and pamphlets. For exam-ple, testing companies and state agencies provide much information online;this is less available to those without home computers. If we move to prepara-tory experiences, rather than preparatory materials, there are clearly enor-mous differences in the access afforded to students situated in different com-munities. Some students are much more likely to have individual tutors;others will be more likely to experience low-level in-school remediation.These differences are not the responsibility of testing companies. However,test users and those who interpret test scores should acknowledge that inequi-table access exists and may influence students’ scores, even when studentsare sitting for the same standardized test.

Opportunity to Learn

Differences in test preparation are but one piece of the substantial inequal-ities that exist in students’ opportunity to learn. The issue of opportunity tolearn is one that should be central to how scores are interpreted (AERA et al.,1999): A given

test score may accurately reflect what the test taker knows and can do, but low scoresmay have resulted in part from not having had the opportunity to learn the materialtested as well as from having had the opportunity and having failed to learn. (AERAet al., 1999, p. 76)

Understanding and reporting which of these cases applies should be verymuch a part of appropriate test use, especially when serious consequences areat stake. It is one thing for students not to have learned materials because theyhave put in too little effort; it is quite another if the school system has failed toprovide students with the necessary opportunities to learn.

Part of the problem in using opportunity to learn, according to the JointStandards, is that no one agreed-on definition exists for what constitutes areasonable opportunity. However, I would argue in terms that parallel JusticeBrennan’s comment on pornography: that one knows meaningful differencesin opportunity to learn when one sees them. Yet, many policy makers put ablind eye to these differences, preferring to blanket them in a thick cover ofequality in testing.

GI Forum, Image de Tejas v. Texas Education Agency (2000), brought onbehalf of minority students against Texas’s TAAS exam, provides an excel-lent example.2 Judge Prado argued that though many aspects of Texas’s edu-cational system were unequal, the rigid alignment that the TAAS exam had


created between test and curriculum served to make those differences lessimportant (GI Forum, Image de Tejas v. Texas Education Agency, 2000).Unfortunately, rigid alignment between test and curriculum cannot correctmany disparities known to affect opportunity to learn. These include differ-ences in teacher quality, concentrations of poverty, class size, bilingual sta-tus, and housing policies (Darling-Hammond & Sykes, 2003; Finn, Gerber,Achilles, & Boyd-Zaharias, 2001; Firestone et al., 2000; Gaudet, 2000;Natriello & Pallas, 2001; Orfield, 1993, 2001; Orfield & Yun, 1999). In fact,there are indications that testing policies may drive good teachers away fromschools with largely disadvantaged populations (Darling-Hammond &Sykes, 2003).

A recent National Research Council report suggests that “one way ofaddressing fairness in assessment is to take into account examinees’historiesof instruction—or opportunities to learn the material being tested—whendesigning assessments and interpreting students’ responses” (Pellegrinoet al., 2002, p. 7). Conditional inferences would, in theory, offer some hope ofreducing high-stakes consequences among students who have not had rea-sonable opportunity to learn. The likelihood of implementing a program ofconditional inferences in which, for example, students’ test performanceswere interpreted in light of relevant background information, is problematicat this time. The National Research Council report noted that there are techni-cal obstacles for making such inferences and that efforts to do so systemati-cally have been carried out on a small scale only (Pellegrino et al., 2002).Financing such approaches presents another obstacle, especially in light ofcurrent budget deficits both at the state and federal levels. More serious arefundamental philosophical issues that intersect with the debates over affir-mative action. Why should some students have a conditional score thatpasses them when others’ higher conditional scores are failures?

Given the hurdles of conditional inference making, we are left with a sin-gle bar over which students from every sort of advantage and disadvantageare equally expected, if not equally supported, to surmount. Increasingly, weare relying on standardized testing both to establish the bar and to be the pri-mary engine over that bar. Our nation’s policy makers are looking to testingto solve problems of low standards, motivation, inadequate instruction andcurriculum, and inequity. It would be convenient if one policy tool couldaccomplish all these goals. (It would be equally convenient, and equally mis-guided, to imagine that other complex systems could be globally improvedby relying heavily on a single policy tool—say jail building for reducingcrime or tax reductions for our economic woes).

A one-tool solution makes for good sound bites, but few complex socialissues will actually be resolved by the overexercise of a single approach.


Therefore, we must build broader and more thoughtful policies. These poli-cies should embrace varied forms of educational assessment. They shouldalso address other educational issues beyond assessment, such as the inade-quate distribution of high-quality teachers and other fundamental resourcesfor learning. In addition, we should redirect policy makers’attention to socialissues that have fallen from grace, including income supports, housing,health care, and employment, for these issues do much to shape the too-familiar patterns found among those who succeed in America’s educationalsystem and those who do not.

NOTES

1. In Debra P. v. Turlington (1979), lawyers for the plaintiffs (several Black students inFlorida) argued in district court that Florida’s minimum competence exam, the SATII, violatedtheir clients’ rights of due process and equal protection rights under the 14th Amendment of theConstitution. The plaintiff’s case was upheld. The case helped establish a precedent that testscores cannot be used to deny diplomas unless students and educators have adequate time to alignteaching and learning to the test.

2. In GI Forum, Image de Tejas v. Texas Education Agency (2000), lawyers for the plaintiffs(minority students in Texas) sued the Texas Education Agency and related state actors in districtcourt. Plaintiffs charged that the use of the TAAS exam as a high school graduation requirementviolated students’due process under the 14th Amendment of the Constitution and that the TAAStest had discriminatory impact, thus violating Title VI of the Civil Rights Act of 1964. The plain-tiffs were not successful and judgment was granted in favor of the defendants.

REFERENCES

American Educational Research Association, American Psychological Association, & NationalCouncil on Measurement in Education. (1999). Standards for educational and psychologicaltesting. Washington, DC: American Educational Research Association.

Amrein, A. L., & Berliner, D. C. (2002). High-stakes testing, uncertainty, and student learn-ing. Education Policy Analysis Archives, 10(18). Retrieved from http://epaa.asu.edu/epaa/v10n18/

Banks, J. A. (1994a). Multicultural education: An introduction. Needham Heights, MA: Allyn &Bacon.

Banks, J. A. (1994b). Multiethnic education: Theory and practice (3rd ed.). Needham Heights,MA: Allyn & Bacon.

Banks, J. A. (2001). Multicultural education: Historical development, dimensions, and practice.In J. A. Banks & C. A. M. Banks (Eds.), Handbook of research on multicultural education(pp. 3-24). San Francisco: Jossey-Bass.

Baron, R., Tom, D., & Cooper, M. (1985). Social class, race, and teacher expectations. In J. B.Dusek (Ed.), Teacher expectancies (pp. 251-269). Hillsdale, NJ: Lawrence Erlbaum.

Bishop, J., & Mane, F. (2001). The impacts of minimum competency exam graduation require-ments on college attendance and early labor market success of disadvantaged students. InG. Orfield & M. L. Kornhaber (Eds.), Raising standards or raising barriers? Inequality andhigh-stakes testing in public education (pp. 51-84). New York: Century Foundation.

Black, P., & Wiliam, D. (1998). Inside the black box: Raising standards through classroomassessment. Phi Delta Kappan, 80(2), 139-148.


Blank, R. K., Porter, A., & Smithson, J. (2001). New tools for analyzing teaching, curriculumand standards in mathematics & science: Results from the survey of enacted curriculum pro-ject. Washington, DC: Council of Chief State School Officers.

Business-Higher Education Forum. (2002). Investing in people: Developing all of America’s tal-ent on campus and in the workplace. Retrieved from http://www.acenet.edu/bookstore/pdf/investing_in_people.pdf

Callahan, R. E. (1962). Education and the cult of efficiency. Chicago: University of ChicagoPress.

Chase, W. G., & Simon, H. A. (1973). The mind’s eye in chess. In W. G. Chase (Ed.), Visual infor-mation processing (pp. 215-281). New York: Academic Press.

Chi, M. T. H., Glaser, R., & Farr, M. J. (Eds.). (1988). The nature of expertise. Hillsdale, NJ:Lawrence Erlbaum.

Cizek, G. (1998). Putting standardized tests to the test. Fordham Report, 2(11), 1-51.Cook, P. J., & Ludwig, J. (1998). The burden of “acting White”: Do Black adolescents disparage

academic achievement? In C. Jencks & M. Phillips (Eds.), The Black-White test score gap(pp. 375-400). Washington, DC: Brookings Institution.

Council of Chief State School Officers. (2002). Annual survey of state student assessment pro-grams: Summary report, fall 2002. Washington, DC: Author.

Cross, W. E. (1995). Oppositional identity and African-American youth: Issues and prospects. InW. D. Hawley & A. W. Jackson (Eds.), Toward a common destiny: Improving race and ethnicrelations in America (pp. 185-204). San Francisco: Jossey-Bass.

Darling-Hammond, L. (2001). Inequality and access to knowledge. In J. A. Banks & C. A. M.Banks (Eds.), Handbook of research on multicultural education (pp. 465-483). San Fran-cisco: Jossey-Bass.

Darling-Hammond, L., Ancess, J., & Falk, B. (2001). Authentic assessment in action: Studies ofschools and students at work. New York: Teachers College Press.

Darling-Hammond, L., & Sykes, G. (2003). Wanted: A national teacher supply policy for educa-tion: The right way to meet the “Highly Qualified Teacher” challenge. Education PolicyAnalysis Archives, 11(33). Retrieved from http://epaa.asu.edu/epaa/v11n33/

Debra P. v. Turlington, 474 F. Supp. 244 (M.D. Fla. 1979).Delpit, L. (1995). Other people’s children. New York: New Press.Dorn, S. (1998). The political legacy of school accountability systems. Education Policy Analy-

sis Archives, 6(1). Available from http://epaa.asu.edu/epaa/v6n1.htmlEisner, E. W. (1999). The uses and limits of performance assessment. Phi Delta Kappan, 80(9),

658-660.Elmore, R. F. (2002). Unwarranted intrusion. Education Next, 2(1). Available from http://www.

educationnext.org/20021/30.htmlElmore, R. F., Fuhrman, S., & Abelmann, C. H. (1996). The new accountability in state educa-

tion reform: From process to performance. In H. Ladd (Ed.), Holding schools accountable:Performance-based reform in education. Washington, DC: Brookings Institution.

Elmore, R. F., & Rothman, R. (Eds.). (1998). Testing, teaching, and learning: A guide for statesand school districts. Washington, DC: National Academy Press, National Research CouncilCommittee on Title I Testing and Assessment.

Ferguson, R. (1998). Teachers’perceptions and expectations and the Black-White test score gap.In C. Jencks & M. Phillips (Eds.), The Black-White test score gap (pp. 318-374).Washington,DC: Brookings Institution.

Ferguson, R. (2002). What doesn’t meet the eye: Understanding and addressing racial dispari-ties in high-achieving suburban schools. Retrieved December 20, 2002, from http://www.ncrel.org/gap/ferg/


Finn, J., Gerber, S., Achilles, C. M., & Boyd-Zaharias, J. (2001). The enduring effects of smallclasses. Teachers College Record, 103(2), 145-183.

Firestone, W. A., Camilli, D., Yurecko, M., Monfils, L., & Mayrowetz, D. (2000). State stan-dards, socio-fiscal context and opportunity to learn in New Jersey. Education Policy AnalysisArchives, 8(35). Retrieved from http://epaa.asu.edu/epaav8n35/

Fordham, S., & Ogbu, J. (1986). Black students’school success: Coping with the burden of “act-ing White.” Urban Review, 18(3), 176-206.

Gardner, H. (1992). The unschooled mind: How children think and how schools should teach.New York: Basic Books.

Gardner, H. (1999). The disciplined mind. New York: Basic Books.Garrison, L. (1989). Programming for the gifted American Indian student. In C. J. Maker &

S. Schiever (Eds.), Critical issues in gifted education: Defensible programs for cultural andethnic minorities (pp. 116-127). Austin, TX: Pro-Ed.

Gaudet, R. (2000). Effective school districts in Massachusetts. Available from http://www.donahue.umassp.edu/publications/donapub.htm

Gerstner, L. (2002, March 14). The tests we know we need. The New York Times, p. A31.GI Forum, Image de Tejas v. Texas Education Agency, 87 F. Supp. 667 (W.D. Tex. 2000).Goals 2000: Educate America Act of 1994. Retrieved from http://www.ed.gov/legislation/

GOALS2000/TheAct/Gong, B., & Hill, R. (2001, March). Some consideration of multiple measures in assessment and

school accountability. Paper presented at the seminar Using Multiple Measures and Indica-tors to Judge Schools’ Adequate Yearly Progress under Title I, Washington, D.C.

Grissmer, D. W., Flanagan, A., Kawata, J., & Williamson, S. (2000). Improving student achieve-ment: What state NAEP scores tell us. Santa Monica, CA: RAND.

Haertel, E. (1999). Performance assessment and education reform. Phi Delta Kappan, 80(9),662-666.

Hambleton, R. K., Brennan, R. L., Brown, W., Dodd, B., Forsyth, R. A., Mehrens, W. A., et al.(2000). A response to “setting reasonable and useful performance standards” in the NationalAcademy of Sciences’ “Grading the Nation’s Report Card.” Educational Measurement:Issues and Practice, 19(2), 5-14.

Haney, W. (2000). The myth of the Texas miracle in education. Education Analysis and PolicyArchives, 8(41). Retrieved from http://epaa.asu.edu/epaa/v8n41/

Haney, W. (2001, January). Revisiting the myth of the Texas miracle in education: Lessons aboutdropout research and dropout prevention. Initial estimates using the common core of data.Paper presented at the Conference on Dropouts in America, Harvard Graduate School ofEducation, Cambridge, MA.

Haney, W. (2002). Lake Woebeguaranteed: Misuse of test scores in Massachusetts, Part I. Edu-cation Policy Analysis Archives, 10(24). Retrieved from http://epaa.asu.edu/epaa/v10n24/

Hauser, R. (2001). Should we end social promotion? Truth and consequences. In G. Orfield &M. L. Kornhaber (Eds.), Raising standards or raising barriers? Inequality and high-stakestesting in public education (pp. 151-178). New York: Century Foundation.

Heubert, J. P., & Hauser, R. M. (Eds.). (1999). High stakes: Testing for tracking, promotion, andgraduation. Washington, DC: National Academy Press, National Research Council, Com-mittee on Appropriate Test Use.

Hillocks, G. (2002). The testing trap: How state writing assessments control learning. NewYork: Teachers College Press.

Holmes, C. T. (1989). Grade level retention effects: A meta-analysis of research studies. InL. Shepard & M. L. Smith (Eds.), Flunking grades: Research and policies on retention(pp. 16-33). London: Falmer.


Howard, J., & Hammond, R. (1985). Rumors of inferiority: The hidden obstacles to Black suc-cess. The New Republic, 3686, 17-21.

Jencks, C., & Phillips, M. (1998). The Black-White test score gap: An introduction. In C. Jencks& M. Phillips (Eds.), The Black-White test score gap (pp. 1-51). Washington, DC: BrookingsInstitution.

Kane, T., & Staiger, D. O. (2002). Volatility in school test scores: Implications for test-basedaccountability systems. Education Next. Available from http://www.educationnext.org/20021/56.html

Klein, S. P, Hamilton, L. S., McCaffrey, D. F., & Stecher, B. M. (2000). What do test scores inTexas tell us? Santa Monica, CA: RAND.

Koretz, D., & Barron, S. (1998). The validity of gains in scores on the Kentucky instructionalresults information system (KIRIS). Santa Monica, CA: RAND.

Koretz, D., Klein, S., McCaffrey, D., & Stecher, D. (1994). Can portfolios assess student perfor-mance and influence instruction? Santa Monica, CA: RAND.

Kornhaber, M. L. (in press). Standards, assessment, and equity. In J. Banks & C. A. M. Banks(Eds.), Handbook of multicultural education. San Francisco: Jossey-Bass.

Kornhaber, M. L., & Orfield, G. (2001). High-stakes testing policies: Examining their assump-tions and consequences. In G. Orfield & M. L. Kornhaber (Eds.), Raising standards or rais-ing barriers? Inequality and high-stakes testing in public education (pp. 1-18). New York:Century Foundation.

Ladson-Billings, G. (2001a). Multicultural teacher education: Research, practice, and policy.In J. A. Banks & C. A. M. Banks (Eds.), Handbook of research on multicultural education(pp. 747-759). San Francisco: Jossey-Bass.

Ladson-Billings, G. (2001b). Crossing over to Canaan: The journey of new teachers in diverseclassrooms. San Francisco: Jossey-Bass.

Levin, H. M. (2001). High-stakes testing and economic productivity. In G. Orfield & M. L.Kornhaber (Eds.), Raising standards or raising barriers? Inequality and high-stakes testingin public education (pp. 39-49). New York: Century Foundation.

Lewin, T., & Medina, J. (2003, July 31). To cut failure rate, schools shed students. The New YorkTimes, p. A1.

Linn, R. L. (2000). Assessments and accountability. Educational Researcher, 29(2), 4-16.Linn, R. L., Baker, E. L., & Betebenner, D. W. (2002). Accountability systems: Implications of

requirements of the No Child Left Behind Act of 2001. EducationalResearcher, 31(6), 3-16.Madaus, G., & Clarke, M. (2001). The adverse impact of high-stakes testing on minority stu-

dents: Evidence from one hundred years of test data. In G. Orfield & M. Kornhaber (Eds.),Raising standards or raising barriers? Inequality and high-stakes testing in public education(pp. 85-106). New York: Century Foundation.

Massachusetts Department of Education. (2003). Massachusetts curriculum frameworks. Avail-able at http://www.dow.mass.edu/frameworks/current.html

McDonnell, L. M., McLaughlin, M. J., & Morison, P. (Eds.). (1997). Educating one and all: Stu-dents with disabilities and standards-based reform. Washington, DC: National AcademyPress, National Research Council, Committee on Goals 2000 and the Inclusion of Studentswith Disabilities.

McNeil, L. (2000). Contradictions of school reform: Educational costs of standardized testing.New York: Routledge Kegan Paul.

McNeil, L., & Valenzuela, A. (2001). The harmful impact of the TAAS system of testing inTexas: Beneath the accountability rhetoric. In G. Orfield & M. L. Kornhaber (Eds.), Rais-ing standards or raising barriers? Inequality and high-stakes testing in public education(pp. 127-150). New York: Century Foundation.


Meier, D. (2000). Educating a democracy. In J. Cohen & J. Rogers (Eds.), Will standards savepublic education? (pp. 3-31). Boston: Beacon.

Mickelson, R. A. (1990). The attitude-achievement paradox among Black adolescents. Sociol-ogy of Education, 63(1), 44-61.

Miller, G. (1956). The magical number seven: Some limits on our capacity for processing infor-mation. The Psychological Review, 63, 81-97.

Moore, D. (2000). Chicago’s grade retention program fails to help retained students. Chicago:Designs for Change.

Moore, D., & Hanson, M. (2001). School system leaders propose ineffective strategies contra-dicted by results and research. Chicago: Designs for Change.

Murnane, R. (2000). The case for standards. In J. Cohen & J. Rogers (Eds.), Will standards savepublic education? (pp. 57-63). Boston: Beacon.

National Alliance of Business. (2002). About the organization. Retrieved from http://www.nab.com/about.htm

National Center for Education Statistics. (2003). The nation’s report card. Retrieved from http://www.NCES.ed.gov/nationsreportcard/mathematics/results/ and from http://www.NCES.ed.gov/nationsreportcard/reading/results/

National Commission on Excellence in Education. (1983). A nation at risk: The imperative foreducation reform. Washington, DC: Government Printing Office.

National Governors Association. (2001). Standards, assessments and accountability. Retrievedfrom http://www.nga.org/center/topics/1,1188,D_413,00.html

Natriello, G., & Pallas, A. (2001). The development and impact of high-stakes testing. InG. Orfield & M. L. Kornhaber (Eds.), Raising standards or raising barriers? Inequality andhigh-stakes testing in public education (pp. 19-38). New York: Century Foundation.

Neill, M. (with Gayler, K.). (2001). Do high-stakes graduation tests improve learning outcomes?Using state-level NAEP data to evaluate the effects of mandatory graduation tests. InG. Orfield & M. L. Kornhaber (Eds.), Raising standards or raising barriers? Inequality andhigh-stakes testing in public education (pp. 107-125). New York: Century Foundation.

No Child Left Behind Act of 2001, Pub. L. No. 107-110, 115 Stat. 1425 (2002). Also availablefrom http://www.ed.gov/legislation/ESEA02

Ogbu, J. U. (1978). Minority education and caste: The American system in cross-cultural com-parison. New York: Academic Press.

Ogbu, J. U. (1991). Immigrant and involuntary minorities in comparative perspective. In M. A.Gibson & J. U. Ogbu (Eds.), Minority status and schooling: A comparative study of immi-grant and involuntary minorities (pp. 3-33). New York: Garland.

Orfield, G. (Ed.). (1993). Separate and unequal in the metropolis: The changing shape of theschool desegregation battle. Washington, DC: Brookings Institution.

Orfield, G. (2001). Schools more separate: Consequences of a decade of resegregation. Re-trieved from http://www.law.harvard.edu/groups/civilrights/publications/resegregation01/presssegexs.html

Orfield, G., & Kornhaber, M. L. (Eds.). (2001). Raising standards or raising barriers? Inequalityand high-stakes testing in public education. New York: Century Foundation.

Orfield, G., & Yun, J. T. (1999). Resegregation in American schools. Cambridge, MA: Har-vard Civil Rights Project. Retrieved from http://www.law.harvard.edu/groups/civilrights/publications/resegregation99/resegregation99.html

Peabody, Z. (2003, August 3). Paige’s methods at HISD reassessed. Houston Chronicle. Re-trieved from http://www.chron.com/cs/CDA/story.hts/nation/2024163

Pellegrino, J. W., Chudowsky, N., & Glaser, R. (2002). Knowing what students know: The sci-ence and design of educational assessment. Washington, DC: National Academy Press.


Perkins, D. N., & Salomon, G. (l988). Are cognitive skills context-bound? Educational Researcher,18(1), 16-25.

Pick, G. (2000). Taking the road less traveled: Authentic assessment in other locales. Retrievedfrom http://www.catalyst-chicago.org/9-00/0900authentic.htm

Pipho, C. (1986, May 12). Tracking the reforms, Part 12. Education Week, IV(34), 20.Platzer, M., Novak, C., & Kazmierczak, M. (2002). Cybereducation 2002. Washington, DC:

American Electronics Association.Ravitch, D. (1996). National standards in American education: A citizen’s guide. Washington,

DC: Brookings Institution.Resnick, L., & Nolan, K. (1995). Where in the world are world-class standards? Educational

Leadership, 52(6). Retrieved from http://www.ascd.org/readingroom/edlead/9503/resnick.html

Salomon, G., & Perkins, D. N. (1989). Rocky roads to transfer: Rethinking mechanisms of aneglected phenomenon. Educational Psychologist, 24(2), 113-142.

Schemo, D. J. (2003, July 26). Education secretary defends school system he once led. The NewYork Times, p. A9.

Simmons, W., & Resnick, L. (1993). Assessment as the catalyst of school reform. EducationalLeadership, February, 11-15.

Sizer, T. (1984). Horace’s compromise. Boston: Houghton Mifflin.Skrla, L., Scheurich, J. J., & Johnson, J.F. (2000). Equity-driven achievement-focused school dis-

tricts: A report on systemic school success in four Texas school districts serving diverse stu-dent populations. Austin: University of Texas at Austin, The Charles A. Dana Center.

Spalding, E. (2000). Performance assessment and the new standards project: A story of serendip-itous success. Phi Delta Kappan, 81(10), 758-764.

Spencer, M. B., Noll, E., Stoltzfus, J., & Harpalani, V. (2001). Identity and school adjustment:Revisiting the “acting White” assumption. Educational Psychologist, 36(1), 21-30.

Stotsky, S. (2000). What’s at stake in the K-12 standards war. New York: Peter Lang.Taylor, W. (2000, November 15). Standards, tests, and civil rights. Education Week, 20(11), 56,

40-41.Tyack, D. (1974). The one best system. Cambridge, MA: Harvard University Press.Valenzuela, A. (1999). Subtractive schooling: U.S.–Mexican youth and the politics of caring.

Albany: State University of New York Press.Viadero, D. (1994, July 13). Teaching to the test. Education Week, XIII(39), 21-25.Wiggins, G. (1998). Educative assessment: Designing assessments to inform and improve stu-

dent performance. San Francisco: Jossey-Bass.Winerip, M. (2003a, June 4). In the affluent suburbs, an invisible race gap. The New York Times,

p. B8.Winerip, M. (2003b, June 18). On education: Moving quickly through history. The New York

Times, p. B10.Zessoules, R., & Gardner, H. (1991). Authentic assessment: Beyond the buzzword and into the

classroom. In V. Perrone (Ed.), Assessment in schools (pp. 47-71). Washington, DC: Associ-ation for Supervision and Curriculum Development.

Mindy L. Kornhaber is an assistant professor in the Department of Education PolicyStudies at the Pennsylvania State University College of Education. Her research exploreshow institutions and the policies surrounding them enhance or impede human develop-ment and how cognitive ability can be developed both to a high level and on an equitablebasis.


appropriate and inappropriate forms of testing, assessment

Documents