international journal of testing, 1(3&4), 319 … 2001... · 320 van derlinden of the items and...

8
INTERNATIONAL JOURNAL OF TESTING, 1(3&4), 319-326 Copyright @ 2001, Lawrence Erlbaurn Associates, Inc. 1111 Applying the Rasch Model: Fundamental Measurement in the Human Sci- ences. Trevor G. Bond and Christine M. Fox (Eds.), Mahwah, NJ: Lawrence Erlbaum Associates, Inc., 2001, 255 pages,$59.95, hardback; $29.95, paperback. Reviewed by Wim J. van der Linden Department of Educational Measurement and Data Analysis University of Twente Over the last 4 decadesor gO, test theorists have dolle an exemplary work expand- ing measurementtheory, but they have largely ignored their obligation to explain their progressto a larger pub lic. We do not have many hooks that introduce modem measurementto test users in a technically undemanding way. 80 when the Books Editor of this joumal askedme to review App/ying the RaschMode/: Fundamenta/ Measurement in theHuman Sciences, which in its preface statesits mission as ex- plaining the fundaments of measurementto substantive researchers,I nourished high hopes of the text I was going to read. Having finished the hook, I cannot gay the authorshave redeemed my expectations. On the contrary, I fear that the readers for which this hook is written may have serious difficulty understanding what is modem measurement.Or, worse still, that, if this is the only text on measurement they ever read, they may walk about with serious misconceptions for the rest of their lives. However, let me fust summarize the contents of the hook. The authorshave chosen to follow the didactic principle ofleaming by examples. No theory is presentedfor its own sake. As a consequence, most ofthe chaptersin the hook have the characterof a seriesof worked empirical examples, complete with ta- bles with data sets and graphs and tables pres enting computer output. This is a re- spectable approach; nothing is as convincing as an empirical example iliat shows stepby stephow a measurement model is applied, its fit is tested,how the scale values Requests foT reprints should be sentto Wim J. van der Linden, Department of Educational M easure- ment and Data Analysis, University of Twente, P.O. Box 217, 7500 AE Enschede, The Netherlands. E-mail: w.j. [email protected]

Upload: lethuan

Post on 06-Feb-2018

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: INTERNATIONAL JOURNAL OF TESTING, 1(3&4), 319 … 2001... · 320 VAN DERLINDEN of the items and the measurements of the examinees are interpreted, and how these interpretations contribute

INTERNATIONAL JOURNAL OF TESTING, 1(3&4), 319-326Copyright @ 2001, Lawrence Erlbaurn Associates, Inc.

1111

Applying the Rasch Model: Fundamental Measurement in the Human Sci-ences. Trevor G. Bond and Christine M. Fox (Eds.), Mahwah, NJ: LawrenceErlbaum Associates, Inc., 2001, 255 pages, $59.95, hardback; $29.95, paperback.

Reviewed by Wim J. van der LindenDepartment of Educational Measurement and Data Analysis

University of Twente

Over the last 4 decades or gO, test theorists have dolle an exemplary work expand-ing measurement theory, but they have largely ignored their obligation to explaintheir progress to a larger pub lic. We do not have many hooks that introduce modemmeasurement to test users in a technically undemanding way. 80 when the BooksEditor of this joumal asked me to review App/ying the Rasch Mode/: Fundamenta/Measurement in theHuman Sciences, which in its preface states its miss ion as ex-plaining the fundaments of measurement to substantive researchers, I nourishedhigh hopes of the text I was going to read. Having finished the hook, I cannot gaythe authors have redeemed my expectations. On the contrary, I fear that the readersfor which this hook is written may have serious difficulty understanding what ismodem measurement. Or, worse still, that, if this is the only text on measurementthey ever read, they may walk about with serious misconceptions for the rest oftheir lives. However, let me fust summarize the contents of the hook.

The authors have chosen to follow the didactic principle ofleaming by examples.No theory is presented for its own sake. As a consequence, most ofthe chapters in thehook have the character of a series of worked empirical examples, complete with ta-bles with data sets and graphs and tables pres enting computer output. This is a re-spectable approach; nothing is as convincing as an empirical example iliat showsstep by step how a measurement model is applied, its fit is tested, how the scale values

Requests foT reprints should be sent to Wim J. van der Linden, Department of Educational M easure-ment and Data Analysis, University of Twente, P.O. Box 217, 7500 AE Enschede, The Netherlands.E-mail: w.j. [email protected]

Page 2: INTERNATIONAL JOURNAL OF TESTING, 1(3&4), 319 … 2001... · 320 VAN DERLINDEN of the items and the measurements of the examinees are interpreted, and how these interpretations contribute

320 VAN DERLINDEN

of the items and the measurements of the examinees are interpreted, and how theseinterpretations contribute to our understanding of substantive problems. The exam-pIes are taken from several domains, but most are from psychological testing, with anemphasis on applications in developmental psychology.

The first two chapters discuss the nature of measurement in the human sciences.Chapter I, "Why Measurement is Fundamental" (note the difference in use of"fundamental" between the title ofthis chapter and the hook), briefly introduces S.S. Stevens's defmition of measurement and then shows how much effort the natu-ral sciences have put in the definition of measurement scales relative to the paar in-terest in the human sciences. The authors could not be more right; anyone with anactive practice in methodological consultation knows examples of substantive re-search seriously hampered by inconsiderate measurement. The authors then sud-denly claim the following:

In the human sciences, we currently have only one readily accessible tooI to help usconstruct objective, additive scales: the Rasch model. This model can help transformraw data from the human sciences into abstract, equal-interval scales. Equality of in-tervals is achieved through log transformations of raw data odds, and abstraction is

accomplished through probabilistic equations. (p. 7)

These sentences reveal a style of writing that appears to be dominant throughout therest ofthe hook. Strong claims ofunique properties are put forward without anyproof or substantiation. Statements about probabilities and statistical principles aremore often incorrect or meaningless than correct or meaningful. F or example, in theaforementioned statement, the Rasch model is presented as a transformation of rawdata instead of a probabilistic model for distributions of item responses. Expressionssuch as "raw data odds" and "abstraction is accomplished through probabilisticequations" are without any sense (p. 7). An example of a plainly wrong statement ison the next page where it is asserted that all current methods for measurement arebased on "logistic distributions" (p. 7). Alrnost any following page has such claimsor incorrect statements. We return to this point at the end of our review.

The discussion of measurement principles is continued in chapter 2. The chap-ter beging with the presentation of a matrix with the responses of a set of examineeson a set of items, and then presents an informal Guttman analysis, whereon the au-thors explain that the Guttman model is too rigid to deal with inconsistencies in ac-tual response patterns. They then calculate item difficulties and person abilities asfractions of items correct, and next try making us believe that all the Rasch modelamounts to is a naturallogarithmic transformation of these fractions to get a "linearscale" (p. 17). They even claim that examples of"such mathematical transforma-tions to produce linear measures abound in the physical sciences" (p. 17), and con-tinue to state that such scale~s are necessary to avoid "the problem of bias towardscores in the middle ofthe scale, and against persons who score at the extremes"

Page 3: INTERNATIONAL JOURNAL OF TESTING, 1(3&4), 319 … 2001... · 320 VAN DERLINDEN of the items and the measurements of the examinees are interpreted, and how these interpretations contribute

321

BOOK REVIEWS

(p. 17). They do not even hint to the probabi1istic nature of the modem measure-ment modeIs. Readers who have not forgotten a11 oftheir high school mathematicsmay remember that the logarithmic transformation is monotone and does notchange the order of the rows and columns in a Guttman ma~. They will wonderhow this transformation solves the problem of inconsistencies in the response pat-tems that was used to motivate the choice foT the Rasch model.

In chapters 3 and 4 the use of the Rasch model is illustrated by the example of afictitious developmental psychological pathway of a child and an empirical dataset from the Bond Logical Operations Test foT the cognitive development of ado-lescents. Key in the authors' exposition is a map of the items and examinees on aline, along with a confidence band about the line to check the fit among the items,examinees, and the model. This representation is a powerful instrument to showhow measurement works. However, unlike the authors' claim, the representation isnot unique, to the Rasch model; any scaling method foT test responses,psychophysical data, paired comparisons, or to unfold ratings is based on the sameprinciple of mapping items and examinees on a common line. Difficulty and abilityestimation is again presented as "a logarithmic transformation on the item and per-son data" (p. 29), a statement that now easily creates the definitive misunderstand-ing that these estimates are even the resultofseparate transformations on the rowand column margins of the data ma~. Another misunderstanding arises becausethe authors fail to explain the difference between estimation errors and estimates ofstandard errors of estimation. On p. 29 they indicate that the size of the symbols foTthe ability and difficulty estimates along the scales represent their error. The sameconfusion is present in all graphs, tables, and comments later in the hook. If theseare really errors, why not subtract them from the ability and difficulty estimates toget true values? An informative section is the one in chapter 4 where the person anditem distributions are compared to show when a test is targeted at the right leveland when it is not. In the same chapter, the authors rather casually introduceWright's infit and outfit statistics and postpone a more systematic treatment oftheissue of model fit to chapter 12.

The problem ofhow to check iftwo tests measure the same ability is introduced inchapter 5. This question is also consistently (and erroneously) referred to as the prob-lem of"equating two tests." From an item response theory (IRT) point ofview, theobvious approach would be to calibrate the two tests concurrently and check the fit ofthe model. However, the authors prefer to calibrate the two tests separately, plot theability estimates against each other, and check if95% ofthe points fall in the confi-dence band (calculated under the assumption of independence). This method doesnot give any information on the presence of a common ability variabie. Any two vari-ables individually fitting the Rasch model treated in this way would yield a plot withthe required proportion of points within the confidence band. On p. 57, it is wronglyimplied that tests measuring a common variabIe can only have the same mean abilityfOT a group of examinees if they are equally difficult. On p. 59, the aforementioned

Page 4: INTERNATIONAL JOURNAL OF TESTING, 1(3&4), 319 … 2001... · 320 VAN DERLINDEN of the items and the measurements of the examinees are interpreted, and how these interpretations contribute

322 VAN DERLINDEN

"check" on a common underlying ability variable is even interpreted as a check toshow iftwo tests are parallel. These examples are only a few ofthe technical errorsand misconceptions that abound in this chapter.

The best chapters in the hook are chapters 6, 7, and 11, which discuss therating scale and partial credit model. The author's criticism of naïve analysis ofLikert-type scale data and their plea for a more formal approach based on a re-sponse model is quite convincing. Chapter 11 goes deeper into the problem ofdiagnosing rating scale items. One of the principles introduced here is that thedistribution of the responses over the item categories in the calibration sampleshould be uniform to get satisfactory estimates. The authors go so far as toblindly recommend collapsing categories if the distributions are skewed. Thevalidity of the principle is not showed in any war. It should not be trusted. Forexample, it is nowhere noted that collapsing categories may seriously bias theestimates of the category parameters estimates (and even cause the model tomisfit). Also, as follows fiom optimal sampling design theory in statistics, theoptimal sample is not a function of the data but of the true parameters in themodel (including those for the abilities of the examinees). And why collapsecategories if we have enough data to get satisfactory estimates but the distribu-tion is not uniform?

Chapter 8, "Measuring Facets Beyond Ability and Difficulty," introduces themany-facets Rasch model. The main application discussed is one with data on rat-erg judging the performances of examinees on tasks differing in difficulty. Themodel itselfis not explained in any war, and no attempt is made to discuss its ap-propriateness for this type of data. It is ncw widely known that a model for rater ef-fects can only be correct if the model bas a multilevel structure to allow for the factthat the raters do not interact directly with the abilities of the examinees but onlythrough their observed performances. An introductory text to model-based mea-surement should focus on the assumptions underlying the models and discuss theirapplicability to the problem at hand. The authors simply ignore this issue and failto notice that the many- facets Rasch model does not have the right structure to ana-

lyze performance ratings.The interest of the authors in developmental problems no doubt bas motivated

them to include chapter 9 on Wilson's saltus model for developmental discontinu-ities. The introduction to the model is most confusing, however. I had to go back tothe originalliterature to remind myself of the fact that one of its key features is thepresence of (unknown) a membership parameter to denote the developmental stageof membership the examinee is in as weIl as parameters to represent possiblechanges in difficulty between classes of items for different stages. The only placewhere the authors allude to these parameters is when they explain that Table 9.4contains a column with "the probabilities of intuitive group membership" (p. 130).Further, many a reader for which this chapter is the first introduction to the modelwill immediately note that the equations for the gaps between the mean abilities for

Page 5: INTERNATIONAL JOURNAL OF TESTING, 1(3&4), 319 … 2001... · 320 VAN DERLINDEN of the items and the measurements of the examinees are interpreted, and how these interpretations contribute

323BaOK REVIEWS

the two classes of items (pp. 126-127) yield the same result and that the asymme-try index seems to be equal to zero by definition. They will wonder why we needthis model if there is never a developmental discontinuity.

Chapter 10 offers various applications of the Rasch model across the humansciences. The examples include an analysis of checklists for public policy involve-ment of nutritional professionals, school opinion surveys among parents, datafrom medical settings, and sporting performances. The chapter also contains ashort section explaining the mechanics of computerized adaptive testing.

Chapter 12, "The Question of Model Fit," beging with a discussion ofthe ideaof residual analysis, then gives a brief verbal description of Wright's infit andoutfit statistics, and finishes with the discussion of a few more fundamental is-sues. I doubt if the reader wil I ever understand the idea of model fit and why it isnecessary to use statistical tests to check its goodness. Again, we meet severalstatements and, expressions that are incorrect or meaningless. For example, on p.176 it is stated that expected residuals can never be equal to zero. On the samepage, the authors assert that, despite different weighing procedures, the infit andoutfit statistics have exactly the same distribution. However, it is impossible forthese statistics to have identical sampling distributions for any sample size. Onthe next page they state that "infit and outfit statistics are reported in various in-terval farms ...in which their expected value is 0" (p. 177), whereas on p. 179they refer to "idiosyncratic mean-square distributions." It is unknown to me what"various interval forms" or "idiosyncratic distributions" are. A more fundamen-tal problem is that the authors never practice the philosophy of model fit analysisthey preach throughout tros hook. If the model and the items do not fit together,their suggestion is to go back to the items, find out what may be wrong, refor-mulate the items, and repeat the fit analysis. This is exactly how we should oper-ate. However, the empirical examples presented in the hook have several itemsthat do not fit the model and for none of them there is a report on how they wererewritten and reanalyzed until they demonstrated fit.

Chapter 13 gives a synthetic overview of the field. It discusses such issues as therelation between the Rasch model and conjoint measurement, construct validity,and multidimensionality. The treatment of each of these issues is toa brief for theintended readers. How many of them would be able to formulate what conjointmeasurement is when they have flnished this chapter?

The two appendixes of this hook explain the technical aspects of the modelsin this hook (Appendix A) and discuss resources on Rasch measurement (e.g.,software, texts, Web sites, professional organizations; Appendix B). The treat-ment of the technical aspects is incomprehensible. The introduction to the Raschmodel beging with the sentence, "Once we have estimated ability (BJ and diffi-culty (Di), the probability of correctly answering an item can be expressed math-ematically as the general statement: P ni(X= 1 )=.f{Bn-Di)" (p. 200). We flfst need toestimate parameters to formulate a model? What parameters? The next equation

Page 6: INTERNATIONAL JOURNAL OF TESTING, 1(3&4), 319 … 2001... · 320 VAN DERLINDEN of the items and the measurements of the examinees are interpreted, and how these interpretations contribute

324 VAN DERLINDEN

shows the same expression but now with parameters. The third equation, againwith estimates instead of parameters, is the Rasch model, which is presented asan "expansion of equation 1 to demonstrate that the function (/) expressing theprobability of a successful response consists of the naturallogarithmic transfor-mation of the person (Bn) and item (Di) estimates" (p. 201). The rating scalemodel is fust shown with a parameter for category 1. The equation is interpretedas the "probability of the person choosing category lover category 0" (p. 203).Then the equation is repeated for category 2, with an analogous interpretation.Next, the same equation is presented for arbitrary category parameters, but nowthe probability is interpreted as the probability of choosing any category. Allequations have estimates instead of parameters, except equation 9, where the au-thors suddenly move back to parameters. The treatment of the fit statistics usedin all of the examples in the hook is also too brief. The equations of the infit andoutfit statistics are presented and as for their null distribution it is observed that,"When these infit and outfit values are distributed as mean squares, their ex-pected value is 1" (p. 208). It is left to the reader to guess to what cases the au-thors refer when they gay "When?" and what the distribution is (none ofmy text-hooks on probability and statistics has a "mean squares distribution"). AppendixB is alanningly narrow-minded in its references to resources on the Raschmodel. The authors refer only to work dolle at the University of Chicago and inAustralia. The rest of the world, including useful results on the Rasch modelproduced by groups of researchers in Austria, Belgium, Denmark, Germany, the

Netherlands, and Sweden, is simply ignored.An introductory text on modern measurement for substantive researchers

should focus on measurement models rather than calculations of estimates, dis-cuss assumptions but avoid derivations, use graphs and illustrations before pre-senting equations, be methodological rather than statistical, and use instructiveexamples. The hook admirably meets the last expectation but rails on the oth-erg. From the very first pages, the term "model" is used to denote trans forma-tions of raw data. The probabilistic nature of modern measurement models isnever revealed; the readers have to wait until the end ofthe hook (Appendix A)before they are able to see that the Rasch model is an expression for a probabil-ity of success on an item. Before th en they are not even shown a graph of a re-sponse function. The Rasch model is an interesting model in that it followsfrom a small set of assumptions. These assumptions are not presented in anyway. Assumptions for the other models introduced in this hook are also not pre-sented.

In addition, an introductory textbook on measurement should make things sim-ple by focusing on essences and avoiding discussions of superfluous technicalities.Attempts to simplify matters by distorting the theory, or worse still, making plaiD

errors, does have a contrary. effect. A statement that is wrong CaD never be under-stood, no matter how many additional explanations and illustrations are offered.

Page 7: INTERNATIONAL JOURNAL OF TESTING, 1(3&4), 319 … 2001... · 320 VAN DERLINDEN of the items and the measurements of the examinees are interpreted, and how these interpretations contribute

325BOOK REVIEWS

This hook bas too many of those statements, and is therefore inaccessible to nov-ices in measurement theory.

Finally, when reading this hook, I became increasingly annoyed with allkinds of "unique" features ascribed to the Rasch model as weIl as the parochialway in which other models were treated. The Rasch model allows for consistentestimators of the parameters (provided we are willing to do what the authors sys-tematically refrain fiom, namely to condition on sufficient statistics for the pa-rameters) and bas results that are easy to interpret. The model bas these featuresbecause it belongs to the exponential family in statistics. These features are notunique and are shared with a host of other mode Is that be long to the same family(including another model developed by Rasch himself, bis Poisson model forreading errors). ODe of the other "unique" features ascribed to the Rasch modelby the authors is its allowance for "linear" measurement. The definition of linearmeasurement that I have been able to extract fiom the hook seems to be mea-surement on a linear scale. We know the mathematical existence of linear func-tions, equations, and operators. But what are linear scales? Is the Rasch modelthe only model with straight sc ales for its parameters, whereas allother modelshave curved scales? Another "unique" feature of the Rasch model seems to be"person-free item estimation" (p. 144). In chapter 10, tros feature is explained asestimation of item difficulty independent of the distribution of the abilities in thesample of persons. At fust sight, the Rasch model seems to share this featurewith any other IRT model. All models formulate the probability of success on anitem conditional on the ability of the examinee and are hence independent of theability distribution. However, it is a statistical fact that the maximum-likelihoodestimators used by the authors have sampling distributions that do depend on theability parameters in the sample. These distributions thus entail a standard errorof estimation for the estimators of the item parameters that is a function of theability parameters. They even imply a (flnite-sample) bias that is also function ofthese parameters. Statements such as "person-free estimation" are thus statisti-cally meaningless. The same feature is also referred to as "specific objectivity"(p. 140). In Rasch's original publications, tros term bas two different meanings:(a) additivity of the parameter structure, and (b) the existence of sufficient statis-tics for the parameters. However, the Rasch model shares the fust feature withnumerous other models (e.g., loglinear modeis, oDe-parameter normal-ogivemodel, regression and analysis of variance models without interaction terms),the second with each family of distributions that meets the well-knownfactorization criterion in statistics.

I have difficulty recommending this hook as an introductory text to modernmeasurement. Readers will be much better off with a balanced, elementary text asHambleton, Swaminathan, and Rogers (1991). Ifthey want to focus solely on theRasch model, my preference would be Wright and Stone (1979), or, better still, theintroductory chapter and chapters 5 and 6 in the original text by Rasch (1960).

Page 8: INTERNATIONAL JOURNAL OF TESTING, 1(3&4), 319 … 2001... · 320 VAN DERLINDEN of the items and the measurements of the examinees are interpreted, and how these interpretations contribute

326 VAN DERLINDEN

REFERENCES

Harnbleton, R. K., Swaminathan, H., & Rogers, J. (1991). Fundamentals of item response theory.Newbury Park, CA: Sage.

Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests. Copenhagen, Den-mark: Danmarks Paedagogiske Institut.

Wright, B. D., & Stone, M. H. (1979). Best test design. Chicago: MESA Press.