stellar schools companyasoraeducation.com/.../files/quantilemeasurements.docx · web viewthe global...

26
Mapping District Level K-12 Scale Scores Onto National & International Assessment Distributions Determining Percentiles of District Mathematics Performance Within State, National and International Testing Regimes David V. Anderson December 20, 2011 Abstract The Global Report Card (GRC) of the George W. Bush Institute, developed by Jay Greene and collaborators, provides percentile measures of mathematics achievement of school districts within the United States against three populations of tested students at the state, national, and international levels. At the state level the students were tested by state operated assessment systems mandated by the No Child Left Behind legislation, while at the national level and international level, comparisons with the students who took the NAEP and PISA tests, respectively, were studied. Given the global nature of our contemporary economic situation, our competitors are both national and international. Knowing how local schools, at the district level, perform relative to high performers both at home and abroad can tell us where further improvements are needed. As Greene has been showing, districts that may seem exemplary within the framework of state testing can turn out to be only average on the larger “stage.” In the work presented here, we have developed appropriate mathematical linkages between the different testing regimes that build on the pioneering work of Greene. The three measures, often employed to represent student skills, are those of scale scores, proficiency percentages, and percentile rankings. Each has advantages for evaluating the quality of various educational settings, but here the focus is on using scale score distributions to enable estimations of percentile rankings. Summary Most states in the United States administer achievement examinations to children attending public schools and do so at most grade levels,

Upload: dinhmien

Post on 18-May-2018

216 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Stellar Schools Companyasoraeducation.com/.../files/QuantileMeasurements.docx · Web viewThe Global Report Card (GRC) of the George W. Bush Institute, developed by Jay Greene and

Mapping District Level K-12 Scale Scores Onto National & International Assessment Distributions

Determining Percentiles of District Mathematics Performance Within State, National and International Testing Regimes

David V. AndersonDecember 20, 2011

AbstractThe Global Report Card (GRC) of the George W. Bush Institute, developed by Jay Greene and collaborators, provides percentile measures of mathematics achievement of school districts within the United States against three populations of tested students at the state, national, and international levels. At the state level the students were tested by state operated assessment systems mandated by the No Child Left Behind legislation, while at the national level and international level, comparisons with the students who took the NAEP and PISA tests, respectively, were studied. Given the global nature of our contemporary economic situation, our competitors are both national and international. Knowing how local schools, at the district level, perform relative to high performers both at home and abroad can tell us where further improvements are needed. As Greene has been showing, districts that may seem exemplary within the framework of state testing can turn out to be only average on the larger “stage.” In the work presented here, we have developed appropriate mathematical linkages between the different testing regimes that build on the pioneering work of Greene. The three measures, often employed to represent student skills, are those of scale scores, proficiency percentages, and percentile rankings. Each has advantages for evaluating the quality of various educational settings, but here the focus is on using scale score distributions to enable estimations of percentile rankings.

SummaryMost states in the United States administer achievement examinations to children attending public schools and do so at most grade levels, including the 4th, 8th, and 10th or 11th grades. Of particular importance are the exams for reading and mathematics skills that are used to determine which children are to be deemed as proficient (at or above grade level) in those subjects. The U.S. Department of Education also administers an achievement examination, the National Assessment of Educational Progress (NAEP), to determine, among other things, what percentages of children are proficient in these same two areas. The NAEP is widely regarded as the standard or benchmark against which the other exams can be “rated.” The NAEP is also known, popularly, as the Nation’s Report Card. The state based exams unfortunately tend to grossly exaggerate proficiency numbers as compared to the NAEP “standard.” For this reason preference would be given to the NAEP proficiencies if they were available locally, which they are not. However, local NAEP proficiencies can be estimated. To that end, in earlier work we developed and used methods that allow one to map state reported proficiencies onto the NAEP proficiency scale. This has enabled stakeholders in US K-12 education outcomes to have a better means to compare schools and districts across state boundaries.

An alternative approach, and the one developed by Greene’s group for the George W. Bush Institute’s Global Report Card (GRC), has been using the percentile rankings as the figure of merit rather than the proficiencies or scale scores. They also saw the need for international comparisons that would enable our educators to have a better idea of what skills our students need to possess to better compete in the global economy. Our focus here is on mathematics testing at the 8th grade level within the United States and on the mathematics testing of 15 year olds within the nations of the OECD who took the PISA examinations.

Page 2: Stellar Schools Companyasoraeducation.com/.../files/QuantileMeasurements.docx · Web viewThe Global Report Card (GRC) of the George W. Bush Institute, developed by Jay Greene and

Asora Education Enterprises

In what follows we develop three related formulations that allow one to calculate approximate scores that any given school district within a state would achieve on the NAEP and PISA tests. Once these scores are known, the cumulative scoring distributions are evaluated to determine percentile rankings within the three assessment environments. The primary equation that allows us to link the various test scoring distributions to one another is

It allows one to evaluate the scale score in terms of a scoring distribution’s mean, , its standard deviation , and the scaling parameter R. This factor R is a measure of the score in terms of how many standard deviation units the score S lies from the scoring distribution’s mean. All of the scoring distributions involved in our analysis, as was the case in Greene’s study, are assumed to be the so-called normal probability distribution.

Our work uses the districts within the state of New Hampshire (NH) as the locally based entities for which national and international comparisons are sought. New Hampshire uses the New England Common Assessment Program (NECAP) to produce average scale scores and proficiency percentages. As mentioned above, the NAEP testing within the United States, and the PISA testing within OECD member countries produce these same kinds of statistics. The three different testing regimes, as one might imagine, employ different units in the specification of the scale scores, means, and standard deviations.

A number of assumptions are taken to make our analysis plausible. An essential assumption is that of ranking order that contends that students ranked in a certain order on one examination will be ranked in that same order on the other tests for which comparisons are being made. This assumption implies, for example, that 8th grade children taking the state based tests and the NAEP will later at the age of 15 preserve their rankings when they take the international PISA test.

Our analysis is based on the evaluation of five different scoring distributions:

1. The NECAP scoring distribution over NH students who took 8th grade math examinations.2. The NAEP scoring distribution over NH students who took 8th grade math tests.3. The NAEP scoring distribution over United States 8th grade students who took this same test.4. The PISA scoring distribution over United States students of age 15 who took that test.5. The PISA scoring distribution over OECD member country testees of age 15 who took this test.

Of these scoring distributions we assume that some of these tested groups are the same or where sampling is employed that the represented groups are the same. Our assumptions and premises include:

a. Tested groups 1. and 2. are sufficiently identical that they not only have the same rankings but have scores located the same distance from their mean scores in their respective standard deviation units.

b. Though perhaps obvious, the scale scores of each district are the same with respect to the scoring distributions of 2. and 3. on the NAEP test.

c. As in (a.), we assume that tested groups 3. and 4. are sufficiently identical.

d. Lastly, as in (b.), the mapped scale score of each district is the same on the PISA test.

As the purpose and scope of this report is a presentation of methodology, we don’t present many results. Thus we restrict the methods’ results presented here to play the limited role of illustrating the validity and uses of these calculations.

1

Page 3: Stellar Schools Companyasoraeducation.com/.../files/QuantileMeasurements.docx · Web viewThe Global Report Card (GRC) of the George W. Bush Institute, developed by Jay Greene and

Asora Education Enterprises

.

Table of Contents

INTRODUCTION 2

THE SCORING DISTRIBUTIONS AND THEIR RELATIONS 3

PLAN A: GENERAL EQUATIONS FOR MAPPING SCALE SCORES TO NAEP & PISA 6

PLAN B CALCULATIONS 8

THE METHOD OF JAY GREENE: PLAN C 11

COMPARING THE METHODS WITH NH DATA 11

TENTATIVE CONCLUSIONS 12

AN AREA YET TO BE EXPLORED 13

ENDNOTES AND REFERENCES 14

1

Page 4: Stellar Schools Companyasoraeducation.com/.../files/QuantileMeasurements.docx · Web viewThe Global Report Card (GRC) of the George W. Bush Institute, developed by Jay Greene and

Asora Education Enterprises

Mapping District Level K-12 Scale Scores Onto National & International Assessment Distributions

Determining Percentiles of District Mathematics Performance Within State, National and International Testing Regimes

David V. AndersonDecember 20, 2011

IntroductionThe principle defect we see in contemporary K-12 student assessment systems is their pervasive exaggeration of student skills. In terms of proficiencies, states typically claim double or more the numbers of children deemed proficient in reading or mathematics than are measured by the NAEP. Paul Peterson and Frederick Hess quipped that, “Johnny can’t read … in South Carolina. But if his folks move to Texas, he’ll be reading up a storm.”i This, of course, refers to the fact that a student deemed proficient in Texas (where standards are lax) will possibly end up in the lowest performance category, below basic, in South Carolina (where standards are higher). Even U.S. Secretary of Education Arne Duncan lamented, “I think we are lying to children and families when we tell children that they are meeting standards and, in fact, they are woefully unprepared to be successful in high school and have almost no chance of going to a good university and being successful.ii” When these and other experts use terms such as “lying,” “hot air”iii and “shenanigans”iv in describing state assessment systems it suggests that they believe the inflation is intentional and sometimes less than honorably motivated. Similarly, when percentile rankings only include regional peers, a school or district’s relative performance may seem better than it would if more distant competitors were included.

To obtain realistic proficiencies, one might suggest the remedy of using the NAEP proficiencies, but this is not possible at the local school or district levels because of the sampling methods used by the NAEP and because of the legislation that prohibits it from local testing. We have developed a number of mapping techniques that have been employed to convert exaggerated (or inflated) state reported local proficiencies into ones consistent with the NAEPv. While these techniques are not error free, we have measured the methods’ errors to be sufficiently small for our intended applications.

While proficiency measures are important, the focus of this report are percentile statistics of school districts that show how the local districts compare with their peers nationally and internationally.

Our effort was inspired by the work of Jay Greene and collaborators in which they developed statistical linking procedures for estimating a district’s percentile rankings against peers at three geographical levels: statewide, national, and international. Every state in the United States tests K-12 children at a number of grade levels to determine a scale score and one or more proficiency measures. They do this for schools, districts, statewide, and for other aggregations as well. At the national level, the NAEP assessments, also known as the Nation’s Report Card, provide similar information. Internationally, within the OECD member states, the PISA examination provides scale scores and also measures of proficiency not directly comparable with those used in the United States. Greene’s work has been used to calculate the percentile information available (on the Internet) from the Global Report Card (GRC) of the George W. Bush Institute.

In our review of the literature we were not able to find any work in which proficiencies per se were mapped. In all of the articles found on the subject, the mapping is always applied to the scores rather than the proficiencies. The work by Jim Hull, at the Center for Public Education,vi work at the National Center for

4

Page 5: Stellar Schools Companyasoraeducation.com/.../files/QuantileMeasurements.docx · Web viewThe Global Report Card (GRC) of the George W. Bush Institute, developed by Jay Greene and

Asora Education Enterprises

Educational Statistics,vii and research by Gary Phillips at the American Institutes for Researchviii are good examples of research employing score mappings. Different kinds of statistical relationships can be used to establish links between the scoring distributions of the same tested group on different examinations. Of the types of linking discussed by Phillips, our ELQ proficiency mapping methods would most closely fall in the “projection” category where the linear regression relationship found in methods of this type now takes on an integrated form- though not explicitly in our analysis.

This report makes some plausible assumptions about the scoring distributions of the NECAP, those of the NAEP, and those of the PISA. We approximate each actual scoring distribution with a normal distribution. We also ignore the portions of the tails of this distribution when they fall outside of the scoring range, but note that if that assumption becomes worrisome in the future, we could then replace the normal distribution with either a truncated normal distribution or with the beta distribution.

We also restrict our study to the testing of mathematics knowledge and do so mostly for the reason that the three examination environments will be most similar for a well-defined area of subject content such as mathematics. We might have chosen science to make this comparison, but even there the curricula differ too much across the world for that subject to provide a good basis for relating one nation’s students’ performance to that of another. Likewise, reading and/or other language arts are too subjective and too different across states and nations to be chosen.

We further restrict our study to the testing of 8th grade students on the state test and on the NAEP while on the international level we focus on the performance of 15 year-old students. This means we conflate American children of roughly age 13 who have taken state based or the NAEP examinations with an older group of American students of age 15 who have taken the PISA. These tested groups are obviously not the same, but we regard them the same for our purposes.

Perhaps the most important assumption holds that students maintain their relative ranking across different tests they take. Moreover, if the same student population takes two different tests, each student and each subset tested group are assumed to achieve scores that have the same offset from the mean score when measured in standard deviation units. And lastly, where sampling is used, as in the cases of the NAEP and the PISA assessments, we assume the resulting distribution characteristics apply to the entire population that was tested at the local level.

The order of presentation is in the order of our own studies of these methods. When, at first, we didn’t understand the GRC methodology, we developed our own approaches, which are presented as Plans A and B. As we completed our analysis it then became more clear how the GRC methods might fit into our assumptions and equations. Our interpretation of that work is presented here as Plan C. If it should turn out that our representations of the GRC methods are not correct, we’ll work to remedy such difficulties later. As a result of this history, our presentation treats the methods, or as we say, “plans,” in the sequence Plan A, Plan B, and Plan C.

This report begins with a review of what we already know about these testing regimes and then attempts to introduce the method without the mathematics. In a later section we present the equations and solutions that enable the calculations of “would be” scores and percentile rankings at the national and global levels.

The Scoring Distributions And Their RelationsOur analysis begins at the state level where districts within the state have reported their students’ performance on the state’s achievement test. While our analysis applies more generally to other subjects and grade levels, we focus on mathematics testing of either 8th grade students or in the case of the PISA tests 15 year-old students. We limit the calculations within our spreadsheets, to those for the state of New Hampshire while pointing out that the methodology extends to districts all across the United States.

5

Page 6: Stellar Schools Companyasoraeducation.com/.../files/QuantileMeasurements.docx · Web viewThe Global Report Card (GRC) of the George W. Bush Institute, developed by Jay Greene and

Asora Education Enterprises

Before displaying the various scale score distributions, we show the mean scale scores and the standard deviation values of the scale score distributions. Table 1 presents the numbers reported for the NECAP, the NAEP and the PISA tests.Type of scale score distribution Mean of scale score

distributionStandard deviation of scale score distribution

State level NECAP over students. 1 = 843.1 1 = 9.98State level NAEP over students. 2 = 292.3 2 = 33.6National level NAEP over students. 3 = 281.7 3 = 36.0National level PISA over students. 4 = 487 4 = 91Global level PISA over students. 5 = 488 5 = 97State level NECAP over districts. 6 = 1 6 = 3.56State level NAEP over districts. 7 = 2 7 = 12.0

Table 1. The various parameters shown here are the ones used in evaluating our analysis. They are mostly taken from published reports issued by the various testing authorities. Those shown in green were not reported but were approximated from other data. At the state level, 8th grade children are tested on the NECAP and some are tested on the NAEP. Figures 1 and 2 present the normal distributions that fit the reported NH 2009 test results for students in New Hampshire. The average scale scores of three of the state’s districts are also shown.

Figure 1. The probability scoring distribution from 2009 NECAP mathematics testing of New Hampshire 8th grade students is shown. The mean score and the scoring distribution’s standard deviation were, 843.0 and 9.98, respectively, measured in the scale score units of the NECAP. Three districts, Bedford, Litchfield and Unity, with scale scores of 849, 842 and 831, respectively, are also shown. We chose them to represent high, medium and low performing districts among the 116 districts within the state.

The NAEP testing also measured this group of 8th grade students, but did so by means of testing a representative sample. For our purposes we assume that the NAEP state level results apply to the same tested

In our calculations, both the mean and standard deviation of the distributions over districts employ weighted terms where the weights are proportional to the number of tested students. Using weighted formulas ensures that the means over districts are the same as the means over students. However, the standard deviations are generally narrower when the distribution is over districts.

6

Page 7: Stellar Schools Companyasoraeducation.com/.../files/QuantileMeasurements.docx · Web viewThe Global Report Card (GRC) of the George W. Bush Institute, developed by Jay Greene and

Asora Education Enterprises

population that took the NECAP test. In Figure 2 the blue curve shows the NAEP scoring distribution within the state along with the scores obtained by the three districts as inferred from their positioning in standard deviation units, which is assumed the same for the NECAP and NAEP testing of the same student population. The red curve shows the national NAEP scoring distribution among 8th graders.

Figure 2. Using published data from the NAEP organization, we show the scoring distributions for 8th grade mathematics testing in 2009. Blue and red, respectively, represent testing on the state (NH) and national levels. As in Figure 1, we show the NAEP scale scores of the three representative districts. For the blue distribution, but not the red one, the locations of these district scale scores are the same as in Figure 1 when measured in the standard deviation units relative to the mean (peak) scores. As is evident in the picture, the nationwide (red) distribution has a lower mean score but a higher standard deviation than the corresponding statewide (blue) distribution.

At this point one can ask the question: How does a given district perform in terms of the national NAEP scoring distribution? One could use a ruler on these graphs to answer that. First, determine from Figure 1 how many standard deviation units from the mean are the districts’ scale scores located. Then with respect to the blue curve of Figure 2 one could measure out the same number of standard deviation units to locate the district scale scores- shown by the vertical red lines. Those scale scores can now be compared to the mean and standard deviation parameters of the nationwide (red) scoring distribution to further estimate the number of standard deviation units of the red distribution separate the NAEP scale scores of these districts from their (red) mean.

But we don’t need a ruler to obtain these results because some simple algebra and arithmetic will do the job, as we shall show in the next section. Before we do that, in Figure 3 we show the scoring distributions for the PISA examinations that pertain to the tested populations with the United States and to the superset of those tested within OECD countries (of which the United States is one).

7

Page 8: Stellar Schools Companyasoraeducation.com/.../files/QuantileMeasurements.docx · Web viewThe Global Report Card (GRC) of the George W. Bush Institute, developed by Jay Greene and

Asora Education Enterprises

Figure 3. Here we display three scoring distributions from the PISA testing that pertain to the United States’ students (in blue), to the subset of the 26 “developed” OECD countries, and to the full set of countries tested in the 2009 PISA mathematics assessments. In analogy to what was shown in Figure 2, when measured in their respective standard deviation units, the scale scores of the districts are the same here for the blue distribution as they were for the red distribution of Figure 2- both pertaining to the tested group of United States students. The red curve shows the scale score distribution of the PISA testing over the global group of students within the OECD subset.As we described above, one could continue a graphical analysis using these drawings to determine firstly, the PISA scale scores of each district within New Hampshire (or any other state) and then, secondly, these scores in terms of the mean and standard deviation of the- red- OECD subset scale score distribution.

Finally, to obtain the percentile rankings, one simply integrates under the appropriate curves up to each district’s scale score. Because these frequency distribution curves are constructed as probability distributions, integrating under them has the correct units for developing percentiles. Thus integrating up to the peak (mean) defines half of the integrated “area,” which is the 50th percentile. And integrating far enough to the right encompasses all of the “area” to correspond with the 100th percentile.

Plan A: General Equations For Mapping Scale Scores To NAEP & PISAIn what follows we present three versions of these mappings. In this section the mappings are characterized by using distributions that are always over students (and not districts). We sometimes call this method, Plan A. The second method, which we label Plan B, sometimes uses distributions over districts at the state level, but is otherwise quite similar to Plan A. Plan B is described in a later section. Finally, we have the methodology of Jay Greene, which is also discussed in a section farther along; we also call it Plan C. Now about Plan A:

8

Page 9: Stellar Schools Companyasoraeducation.com/.../files/QuantileMeasurements.docx · Web viewThe Global Report Card (GRC) of the George W. Bush Institute, developed by Jay Greene and

Asora Education Enterprises

As we mentioned in the Summary, our analysis is based on the evaluation of five different scoring distributions:

1. The NECAP scoring distribution over NH students who took 8th grade math examinations.2. The NAEP scoring distribution over NH students who took 8th grade math tests.3. The NAEP scoring distribution over United States 8th grade students who took this same test.4. The PISA scoring distribution over United States students of age 15 who took that test.5. The PISA scoring distribution over OECD member country testees of age 15 who took this test.

Of these scoring distributions we assume that some of these tested groups are the same or where sampling is employed that the represented groups are the same. Our assumptions and premises include:

a. Tested groups 1. and 2. are sufficiently identical that they not only have the same rankings but have scores located the same distance from their mean scores in their respective standard deviation units.

b. Though perhaps obvious, the scale scores of each district are the same with respect to the scoring distributions of 2. and 3. on the NAEP test.

c. As in (a.), we assume that tested groups 3. and 4. are sufficiently identical.

d. Lastly, as in (b.), the mapped scale score of each district is the same on the PISA test.

Putting each of the preceding enumerated (1., 2., 3., 4., 5.) scoring distributions in algebraic form results in five similar equations:

The first equation calculates the NECAP scale score of a district. It uses the mean and standard deviation of the NECAP scale score distribution over the students (and not over the districts) in the state. Here the state is New Hampshire. The R factor provides or measures the scale score in the standard deviation units of the NECAP scoring distribution.

Eq. 1

The next equation involves the same students as in the preceding equation, but expresses the scale score of the NAEP scale score distribution over the students within the state.

S2=m2+R2 s2 Eq. 2Because the tested groups for the two preceding distributions are sufficiently identical and because we assume rankings and locations relative to the mean remain the same in standard deviation units, we then interpret condition (a.) to require R2 = R1.

The equations for S2 and S3 express the NAEP scale scores of a district, but do so in terms of different NAEP tested distributions. The former is in terms of the NAEP distribution over students within a state- designated by subscript 2- while the latter is in terms of the NAEP national distribution of tested students- designated by subscript 3.

S3=m3+R3s3 Eq. 3

Given the fact that these scale scores are the same, condition (b.), requires S3 = S2.

9

S1=m1+R1 s1

Page 10: Stellar Schools Companyasoraeducation.com/.../files/QuantileMeasurements.docx · Web viewThe Global Report Card (GRC) of the George W. Bush Institute, developed by Jay Greene and

Asora Education Enterprises

As in the case for Eq. 2, here Eq. 4 addresses the same student testees as Eq. 3 but is measured in terms of the PISA scale score distribution instead of the NAEP.

S4=m4+R4 s4 Eq. 4

In analogy with the relationship of Eq. 2 to Eq. 1, here we require R4 = R3.

Finally, the expression for the PISA scale score with respect to its OECD testee distribution is

S5=m5+R5 s5 Eq. 5Here the relation between Eqs. 4 and 5 is like that between Eqs. 2 and 3. Thus the scale scores are the same, S5 = S4, but the distributions cover different tested populations who took the PISA examinations.

In terms of knowns and unknowns, all of the means ( ’s ) and all of the standard deviations ( ’s ) shown in these equations are known from published (or in some cases well approximated) statistical parameters of the various examinations. They are displayed in Table 1 above. Of the S and R values, only S1 is known initially.

The calculations proceed as follows:

Solve Eq. 1 for R1 and use the condition that R2 = R1 allows one to use Eq. 2 to compute S2. Next, using the relation that S3 = S2, we can solve Eq. 3 for R3. Moving to Eq. 4, we substitute the relation that R4 = R3 to solve for S4. At that point we are left with Eq. 5, which uses the fact that S5 = S4 to provide the numerical arguments required to evaluate S5 .

At this stage, all of the distributions are specified, and the scale scores for the different testing regimes determined. Percentiles for a given district on a particular test and tested population combination is then achieved by integrating over the scale score distributions from the minimum score up to the districts’ average scale score. Such integrations generate the cumulative distributions. These are easily calculated with spreadsheet functions, such as Excel’s NORMDIST formula.

Plan B CalculationsIn some testing environments, not all of the relevant parameters are published or easily obtained. Such has been the case in New Hampshire where the standard deviation 1 of the NECAP scale scores, over the students tested within the state, has not been made available (or at least we couldn’t find it). In the preceding section, the values used for that parameter were based on an “intelligent guess,” which if left standing would open this report to well justified criticism.

Thus, here in this section we discuss an alternative pair of steps that enable us to calculate the S2 value. Once its value is determined, the preceding section’s procedures are used to obtain the other values.

Instead of using NECAP scale score distributions over all tested students within the state, we alternatively use the scale score distribution over the districts in the state. We do this for both the NECAP testing and that of the NAEP state level testing.

But NAEP does not report district level scale scores or proficiencies! To get around this limitation we employ our previously developed techniques that allow us to calculate estimates of district level NAEP

10

Page 11: Stellar Schools Companyasoraeducation.com/.../files/QuantileMeasurements.docx · Web viewThe Global Report Card (GRC) of the George W. Bush Institute, developed by Jay Greene and

Asora Education Enterprises

proficiencies. We also have a good idea of the relationship of mean scale scores to the corresponding NAEP proficiencies by looking at the ensemble of states for which both are known and published. Assuming this relationship is linear, we estimate the scale scores using the Excel spreadsheet function TREND. Using the definition of standard deviation, we use these approximated scale scores over districts to estimate the NAEP examination’s standard deviation over districts.

To work with probability distributions over districts rather than students, we need to show how we can “harmonize” their use with the distributions over students.

The average scale score over the districts in a state is not a simple average. Rather it is a weighted average in which the weights are proportional to the number of students in each district. Thus

m=[∑k=1

K

nk Sk] /[∑k=1

K

nk ]gives the mean of the K districts’ scoring distribution. And

s=√∑k=1

K

nk (Sk−m)2/∑k=1

K

nk

gives the appropriate weighted standard deviation. Without proving it here, we know that these weighted formulas preserve the value (the same as over the tested students within the state). You can see this in Table 1. We also know that the distributions over districts generally have a significantly smaller standard deviation than those over students, also evident in Table 1.

We work with two scale score distributions related to the ones depicted above (the blue curves in Figures 1 and 2) but here they are over districts rather than students. Figure 4 shows the NECAP distribution over districts within New Hampshire.

Figure 4. Plan B starts with scale score distributions over district scores (rather than student scores). This is a narrower distribution with a smaller standard deviation ( 3.56 score units

11

Page 12: Stellar Schools Companyasoraeducation.com/.../files/QuantileMeasurements.docx · Web viewThe Global Report Card (GRC) of the George W. Bush Institute, developed by Jay Greene and

Asora Education Enterprises

versus 9.98). Individual district scale scores can be defined in terms of these distributions over districts.To find the corresponding scale scores for the NAEP requires knowing the NAEP scale score distribution over the districts in the state. The relevant distribution for New Hampshire is shown next in Figure 5.

Figure 5. Obtaining the NAEP scale score distribution over districts required us to estimate each district’s NAEP scale score because they are not measured nor reported by the NAEP organization. Based on the author’s techniques for estimating NAEP parameters at the district level, we were able to calculate approximate scale scores for each district. The plot shown here is the result of fitting that distribution to the normal curve.

To make progress along this path, Eqs. 1 and 2 above must be replaced by formulas that give the scale score proficiencies in terms of the normal distributions over districts (rather than over students). These new distributions have the same mean values as before but different and smaller standard deviations. This becomes evident in the use of new subscripts on the terms while the other terms retain the old subscripts. Thus the equation for the NECAP scale score of any given district is given by,

Eq. 6and the corresponding scale score on the NAEP scale is expressed as,

S2=m2+R7 s7 Eq. 7

Using the same line of argument that connected the R values between Eqs. 1 and 2, we similarly contend that R6 = R7 because the ranking of the district among the other districts should remain the same as we move from the NECAP to the NAEP. The process here solves Eq. 6 for R6 or equivalently R7 . This allows one to calculate S2 from Eq. 7. Once this is done, the previous equations, Eqs. 3, 4, and 5, can be used as before to calculate the other unknown parameters. And likewise the cumulative distributions over students (NAEP within the state, NAEP within the United States, and PISA more globally) can be evaluated with NORMDIST to provide the percentile rankings. We should note that the distributions over districts play a

12

S1=m1+R6 s6

Page 13: Stellar Schools Companyasoraeducation.com/.../files/QuantileMeasurements.docx · Web viewThe Global Report Card (GRC) of the George W. Bush Institute, developed by Jay Greene and

Asora Education Enterprises

limited role through Eqs. 6 and 7 that allow us to calculate S2 but we don’t use their cumulative distributions for the estimation of percentiles.

The Method of Jay Greene: Plan CA pioneer of these types of calculations is Jay Greene. He uses the normal distribution representations of student scoring distributions on state, national, and international tests to link the scale score average of any given school district within the United States to produce a percentile ranking against students at these three levelsix. Currently, the George W. Bush Institutes’ Global Report Card uses this methodx. We next attempt to replicate the formulation of that method.

Beyond the assumptions we made in our analysis, the GRC method assumes:

3 = 2 Eq. 8

and

5 = 4 Eq. 9

These assumptions state that the standard deviations of the NAEP test are the same for the statewide and nationwide testing, and that those of the PISA are the same for the United States testees as for the larger tested superset of students globally within the OECD countries.

The relative errors made by these assumptions can be calculated from the values in Table 1. These errors are about 7% and 6%, respectively, for Eqs. 8 and 9.

Substituting these two equations into Eqs. 2 – 5, leads to simplifications enabling the direct calculation of R3 and R5 . Their formulas become:

R3 = R2 + (2 - 3)/2 Eq. 10

and

R5 = R2 + (2 - 3)/2 + (4 - 5)/4 Eq. 11

We believe these two formulas are consistent with the methods used in the GRC estimations.

Comparing The Methods With NH DataWe have evaluated the foregoing formulas and equations with respect to the three methods we discussed:

Plan A: Scale scores and associated percentiles are calculated for NECAP, NAEP and PISA testing environments by linking five distributions- all over students.

Plan B: The same as Plan A, except the NECAP and NAEP scoring formulas at the state level employ distributions over school districts- rather than over students.

Plan C: This is Jay Greene’s method, which assumes that each testing environment, NAEP or PISA, have the same standard deviations in the following sense: The NAEP standard deviations for the state

13

Page 14: Stellar Schools Companyasoraeducation.com/.../files/QuantileMeasurements.docx · Web viewThe Global Report Card (GRC) of the George W. Bush Institute, developed by Jay Greene and

Asora Education Enterprises

level tested population is the same as for the national level population and the PISA standard deviations for the United States tested population is the same as that of the more global OECD tested population.

In Table 2 we show some of the resulting scale scores and associated percentiles for the different tests and “Plans.”

Table 2. Scale scores and percentiles, unless known from published reports, are shown for the different Plans as they apply to the different tests and tested groups. All three plans agree for the estimations at the state level as would be expected from the assumptions. At the national and international levels, significant differences exist between Plan C and the others. When all New Hampshire districts are included, the root-mean-square error of the percentiles in percentile points is 16% at the national level and 17% at the global level.

Tentative ConclusionsStakeholders in K-12 education often live in a “foggy province” where they can’t see how their local schools perform against realistic standards or against their peers. Two kinds of standards of student performance, proficiency criteria and/or percentile rankings, are difficult to measure when different political jurisdictions use incompatible assessment systems. In the case of the NAEP, the political “class” in Washington feared the application of the highly regarded NAEP at the local level and actually prohibited such testing at the school and district levels. This same kind of fear, in our opinion, led policy makers at the state level to develop testing systems that, in most cases, grossly exaggerate student progress. No one wanted to be embarrassed by the poor performance measurements that were likely to ensue in a rigorous testing environment. So, until recently, stakeholders did not have readily available statistics that would provide them with a realistic accounting of student outcomes within our systems of K-12 education.

At least two recent efforts have begun to remove the fog by making estimates, based on various statistical linking methods, of student performance at the local levels:

The most extensive of these undertakings is that of the Global Report Card, which has been using methods of Jay Greene to compare district level performance to that of students at the three geographical levels of the state, of the United States, and of the OECD nations. That approach, of course, has been the subject of this report.

A related and complimentary approach has been developed by this author. We estimate local NAEP proficiency percentages at both school and district levels. To date, however, we have not been able to complete these estimates for the entire United States; only 16 states have been evaluated so far. A guidebook

14

Page 15: Stellar Schools Companyasoraeducation.com/.../files/QuantileMeasurements.docx · Web viewThe Global Report Card (GRC) of the George W. Bush Institute, developed by Jay Greene and

Asora Education Enterprises

to the public schools of Maryland, Virginia, and Washington D.C., using our methods, was published earlier this year. And now we are completing a similar guidebook for public schools in the ten Northeastern States (from Delaware to Maine).

As this report focuses on the GRC method (Plan C) and the related ones we developed (Plans A & B), we restrict the remainder of our comments to them.

We showed that Plan C is a special case of Plan A under the circumstances that certain standard deviation parameters are the same. When the widths (standard deviation) of the two PISA tested populations (United States testees and OECD testees) are the same, and when the widths of the two NAEP tested populations (statewide testees and United States testees) are the same, then Plan A becomes Plan C.

From our analysis, we think that the GRC might use a method like Plan A, which does not depend on equal standard deviation assumptions. Because our work here is preliminary, we think further corroboration and testing should be undertaken to verify our “tentative” conclusions.

We are also encouraged that the GRC information is complimentary (in the mathematical sense) to the author’s NAEP proficiency estimates. The GRC data bank could be expanded to include statistics that would provide NAEP proficiency percentage estimates for students at the local levels as well as in various OECD member states. Such an expansion would have its benefits:

The percentile rankings provide a measure of relative performance among a district’s peers.

The estimated proficiencies provide a measure of absolute performance against knowledge content standards.

Having both would provide stakeholders a better perspective from which local schools and districts could be judged.

An Area Yet To Be ExploredAs we were reviewing the analysis of the methods used in this report, we realized that the district level information provided by the state testing authorities also allows one to calculate a normal distribution for the tested students within each district. One use of such information would enable checking the assumption that the statewide student scoring distribution (NECAP) is normal- or maybe it’s skewed. Such information, for example, might enable one to estimate similar district based distributions of NAEP scores. From them, NAEP proficiency estimates could be calculated independently of this author’s other work on estimated NAEP proficiencies. Other evidence we have seen, suggests that the scoring distributions may well be skewed by the practices of special accommodations in the testing- here the lower tail would be shrunk (or moved up) due to the testing environment advantages extended to students with certain kinds of physical, mental and or economic disadvantages.

15

Page 16: Stellar Schools Companyasoraeducation.com/.../files/QuantileMeasurements.docx · Web viewThe Global Report Card (GRC) of the George W. Bush Institute, developed by Jay Greene and

Asora Education Enterprises

Endnotes and References

16

Page 17: Stellar Schools Companyasoraeducation.com/.../files/QuantileMeasurements.docx · Web viewThe Global Report Card (GRC) of the George W. Bush Institute, developed by Jay Greene and

i Paul Peterson and Frederick Hess, Johnny Can Read… in Some States: Assessing the Rigor of State Assessment Systems,(Education Next, Summer 2005)

ii Arne Duncan, quoted in US News & World Report, February 9, 2009.

iii Kevin Carey, “Hot Air: How States Inflate Their Educational Progress Under NCLB,” (Education Sector, May 2006).

iv Tom Loveless, Are States Honestly Reporting Test Scores in The 2006 Brown Center Report on American Education (The Brookings Institution, 2006), p. 22.

v David V. Anderson, Generating Local NAEP Proficiency Estimates By The Ellipse-Quartic (ELQ) Mapping Methods, Asora Education Enterprises. This unpublished report, ELQ-Mappings.docx, and its associated spreadsheets, (ELQ-Derivation.xlsx), can be accessed and downloaded from: http://asoraeducation.com/page35/page40/page40.html.

vi Jim Hull, Mapping state cut scores against NAEP: The proficiency debate, Center for Public Education website at http://www.centerforpubliceducation.org/site/c.kjJXJ5MPIwE/b.4203833/k.89C6/Mapping_state_cut_scores_against_NAEP_The_proficiency_debate.htm.

vii The National Center for Education Statistics report, Mapping 2005 State Proficiency Standards Onto the NAEP Scales, June 2007, provides another example of mapping scores from state administered assessments to the NAEP scoring scale. The report is available at http://nces.ed.gov/nationsreportcard/pubs/studies/2007482.asp.

viii Gary W. Phillips, Expressing International Educational Achievement in Terms of U.S. Performance Standards: Linking NAEP Achievement Levels to TIMSS1, American Institutes for Research®, April 24, 2007. This report downloads from http://www.air.org/news/documents/naep-timss.pdf

ix Jay P. Greene and Josh B. McGee, When the Best is Mediocre, (Education Next, Winter 2012)

x The George W. Bush Institute’s Global Report Card (GRC) school performance measures can be found at http://www.globalreportcard.org/map.html. Nearly every school district in the United States is represented in the GRC data bank from which stakeholders and others can look up percentile performance against state level, national level, and international level groups of tested students.