standards for talking and thinking about validity

19
Standards for Talking and Thinking About Validity Paul E. Newton Institute of Education, University of London Stuart D. Shaw Cambridge International Examinations, Cambridge, England Standards for talking and thinking about validity have been promulgated in North America for decades. In 1954 two foundational standards were announced: (a) Thou shalt not refer to “the validity of the test” and (b) thou shalt use validity modifier labels, such as “content validity” or “predictive validity.” Subsequently, in 1985, the latter became, thou shalt not use validity modifier labels. These standards for talking about validity have repeatedly been disregarded over the years. Possible reasons include intentional misuse, while upholding standards for thinking about validity; lack of awareness or misun- derstanding of standards for thinking about validity; and genuine divergence from standards for thinking about validity. A historical analysis of disregard for these standards provides a basis for reappraising the concept of validity. We amassed a new body of evidence with which to challenge the frequently asserted claim that a general consensus exists over the meaning of validity. Indeed, the historical analysis provides reason to believe that prospects for achieving consensus over the meaning of validity are low. We recommend that the concept of validity be abandoned in favor of the more general, all-encompassing concept of quality, to be judged in relation to measurement aims, decision making aims, and broader policy aims, respectively. Keywords: validity, quality, evaluation, validation, test, assessment The notion of “standards” at the heart of this discussion is intended to capture the idea of consensus, within a community, concerning how its members ought to behave. Within scientific communities, standards are often expressed implicitly (i.e., as paradigms through which knowledge is constructed). Within professional communities, standards are often expressed explic- itly (i.e., as codes of practice). Whether implicit or explicit, standards are fundamental to communities because they enable individuals to function collectively (i.e., to function as commu- nities). Since the 1950s, the American Psychological Association (APA), the American Educational Research Association (AERA), and the National Council on Measurement in Educa- tion (NCME) have collaborated in the development of standards for educational and psychological testing (known, hereafter, as successive editions of the Standards), with an intention “to promote the sound and ethical use of tests and to provide a basis for evaluating the quality of testing practices” (AERA, APA, & NCME, 1999, p. 1). Each edition has contained a consensus statement on validity, which has evolved over time as the field has developed. Each new edition is the product of debate between many subcommit- tees, representing many subcommunities, and takes many years to develop. The Standards are respected internationally, and conceptions of validity presented in successive editions have been appropriated internationally. Increasingly, in recent years, writers have acknowledged substantial discrepancy between the principles of validity, embodied within these consensus state- ments, and validation practice evident from the wider literature (e.g., Cizek, Rosenberg, & Koons, 2008; Hogan & Agnello, 2004; Hubley & Zumbo, 1996; Jonson & Plake, 1998; Messick, 1988; Shepard, 1993; Wolming & Wikstrom, 2010). This raises an important question: If measurement specialists have genu- inely reached consensus over the concept of validity, then why is there so little evidence of this in validation practice? In the present article, we add to this literature by exploring an apparent disjunction between standards for talking about valid- ity and how validity is actually talked about in the published literature (our use of “talking” includes written text). Our intention is to mark a subtle distinction between standards for talking about validity and standards for thinking about validity. As we will explain shortly, the Standards contain both specific standards for talking about validity and more general standards for thinking about validity. Standards for thinking about valid- ity specify how it ought to be understood (i.e., the accepted meaning of the concept). Standards for talking about validity specify how it ought to be expressed or articulated. The latter clearly follow from the former. Indeed, the point of standards for talking about validity would seem to be to emphasize, or to underline, associated standards for thinking about validity. In short, scientists and professionals ought to talk properly about validity in order that they, and others, continue to think properly about validity. This article was published Online First July 8, 2013. Paul E. Newton, Department of Curriculum, Pedagogy and Assessment, Institute of Education, University of London, London, England; Stuart D. Shaw, Cambridge International Examinations, Cambridge, England. We are very grateful to Cambridge Assessment (University of Cam- bridge Local Examinations Syndicate, which includes Cambridge Interna- tional Examinations) for supporting the preparation of this article. Correspondence concerning this article should be addressed to Paul E. Newton, Department of Curriculum, Pedagogy and Assessment, Institute of Education, University of London, 20 Bedford Way, London WC1H 0AL, England. E-mail: [email protected] This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly. Psychological Methods © 2013 American Psychological Association 2013, Vol. 18, No. 3, 301–319 1082-989X/13/$12.00 DOI: 10.1037/a0032969 301

Upload: stuart-d

Post on 15-Dec-2016

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Standards for talking and thinking about validity

Standards for Talking and Thinking About Validity

Paul E. NewtonInstitute of Education, University of London

Stuart D. ShawCambridge International Examinations, Cambridge, England

Standards for talking and thinking about validity have been promulgated in North America for decades.In 1954 two foundational standards were announced: (a) Thou shalt not refer to “the validity of the test”and (b) thou shalt use validity modifier labels, such as “content validity” or “predictive validity.”Subsequently, in 1985, the latter became, thou shalt not use validity modifier labels. These standards fortalking about validity have repeatedly been disregarded over the years. Possible reasons includeintentional misuse, while upholding standards for thinking about validity; lack of awareness or misun-derstanding of standards for thinking about validity; and genuine divergence from standards for thinkingabout validity. A historical analysis of disregard for these standards provides a basis for reappraising theconcept of validity. We amassed a new body of evidence with which to challenge the frequently assertedclaim that a general consensus exists over the meaning of validity. Indeed, the historical analysis providesreason to believe that prospects for achieving consensus over the meaning of validity are low. Werecommend that the concept of validity be abandoned in favor of the more general, all-encompassingconcept of quality, to be judged in relation to measurement aims, decision making aims, and broaderpolicy aims, respectively.

Keywords: validity, quality, evaluation, validation, test, assessment

The notion of “standards” at the heart of this discussion isintended to capture the idea of consensus, within a community,concerning how its members ought to behave. Within scientificcommunities, standards are often expressed implicitly (i.e., asparadigms through which knowledge is constructed). Withinprofessional communities, standards are often expressed explic-itly (i.e., as codes of practice). Whether implicit or explicit,standards are fundamental to communities because they enableindividuals to function collectively (i.e., to function as commu-nities).

Since the 1950s, the American Psychological Association(APA), the American Educational Research Association(AERA), and the National Council on Measurement in Educa-tion (NCME) have collaborated in the development of standardsfor educational and psychological testing (known, hereafter, assuccessive editions of the Standards), with an intention “topromote the sound and ethical use of tests and to provide a basisfor evaluating the quality of testing practices” (AERA, APA, &NCME, 1999, p. 1).

Each edition has contained a consensus statement on validity,which has evolved over time as the field has developed. Each

new edition is the product of debate between many subcommit-tees, representing many subcommunities, and takes many yearsto develop. The Standards are respected internationally, andconceptions of validity presented in successive editions havebeen appropriated internationally. Increasingly, in recent years,writers have acknowledged substantial discrepancy between theprinciples of validity, embodied within these consensus state-ments, and validation practice evident from the wider literature(e.g., Cizek, Rosenberg, & Koons, 2008; Hogan & Agnello,2004; Hubley & Zumbo, 1996; Jonson & Plake, 1998; Messick,1988; Shepard, 1993; Wolming & Wikstrom, 2010). This raisesan important question: If measurement specialists have genu-inely reached consensus over the concept of validity, then whyis there so little evidence of this in validation practice?

In the present article, we add to this literature by exploring anapparent disjunction between standards for talking about valid-ity and how validity is actually talked about in the publishedliterature (our use of “talking” includes written text). Ourintention is to mark a subtle distinction between standards fortalking about validity and standards for thinking about validity.As we will explain shortly, the Standards contain both specificstandards for talking about validity and more general standardsfor thinking about validity. Standards for thinking about valid-ity specify how it ought to be understood (i.e., the acceptedmeaning of the concept). Standards for talking about validityspecify how it ought to be expressed or articulated. The latterclearly follow from the former. Indeed, the point of standardsfor talking about validity would seem to be to emphasize, or tounderline, associated standards for thinking about validity. Inshort, scientists and professionals ought to talk properly aboutvalidity in order that they, and others, continue to think properlyabout validity.

This article was published Online First July 8, 2013.Paul E. Newton, Department of Curriculum, Pedagogy and Assessment,

Institute of Education, University of London, London, England; Stuart D.Shaw, Cambridge International Examinations, Cambridge, England.

We are very grateful to Cambridge Assessment (University of Cam-bridge Local Examinations Syndicate, which includes Cambridge Interna-tional Examinations) for supporting the preparation of this article.

Correspondence concerning this article should be addressed to Paul E.Newton, Department of Curriculum, Pedagogy and Assessment, Instituteof Education, University of London, 20 Bedford Way, London WC1H0AL, England. E-mail: [email protected]

Thi

sdo

cum

ent

isco

pyri

ghte

dby

the

Am

eric

anPs

ycho

logi

cal

Ass

ocia

tion

oron

eof

itsal

lied

publ

ishe

rs.

Thi

sar

ticle

isin

tend

edso

lely

for

the

pers

onal

use

ofth

ein

divi

dual

user

and

isno

tto

bedi

ssem

inat

edbr

oadl

y.

Psychological Methods © 2013 American Psychological Association2013, Vol. 18, No. 3, 301–319 1082-989X/13/$12.00 DOI: 10.1037/a0032969

301

Page 2: Standards for talking and thinking about validity

The present article will focus upon two of the most fundamentalof standards for talking about validity, both derived directly fromthe Standards, which we refer to colloquially as

1. Thou shalt not refer to “the validity of the test”(TVOTT), that is, as though validity were a property oftests.

2. Thou shalt (not) use validity modifier labels (VMLs), thatis, terms like content validity and predictive validity(there is a not in parentheses because it was promoted bythe first three editions yet rejected by the fourth andfifth).

These two standards are intimately, albeit confusingly, inter-twined. We will demonstrate how they have repeatedly beendisregarded, providing a basis for reflecting upon the desirabilityand viability of standards for thinking and talking about validity.We will conclude from our historical analysis that prospects forreaching consensus over the meaning of validity are low. This isepitomized by the fact that the field has been unable to reachagreement over whether the concept of validity ought to embracethe evaluation of measurement aims alone; the evaluation of mea-surement and decision making aims; or the evaluation of measure-ment, decision making, and broader testing policy aims. Ourrecommendation, faced with this enduring lack of consensus, is toabandon the concept of validity in favor of the broader concept ofquality, applicable less contentiously across the three principalevaluation foci just mentioned.

Validity Means Different Things toDifferent Communities

This article is concerned with standards for talking about valid-ity, as a point of focus for the more general issue of standards forthinking about, that is, for conceptualizing, validity. It thereforeconcerns what is meant by validity within a particular community.The community at the heart of this discussion is an extremelybroad one: the supracommunity of educational and psychologicalmeasurement (EPM). It embraces scientists with a remit for mea-surement within academic settings and professionals with a remitfor measurement within practical settings. It includes experimentalpsychologists, clinical psychologists, educational psychologists,guidance counselors, test developers, personnel psychologists, testregulators, and many more. Implicitly, the reach of this supracom-munity is even broader, because it ought to extend to anyone,academic or practitioner, who relies upon measurement in aneducational or psychological context. This would, for example,include many experimental psychologists who would not specifi-cally consider themselves to be measurement scholars. It mightalso extend to those within other fields of social science research,where similar kinds of measurement procedure are relied upon.Although the more general relevance of this thesis should beappreciated, the article is framed in terms of the explicit develop-ment of standards by the mainstream EPM supracommunity.

Specifying this focus is important because standards of validityfor EPM differ significantly from standards of validity across othercommunities of practice. For instance, within the community offormal logicians, validity refers to deductive arguments, such thatan argument is valid if and only if it is not possible for all its

premises to be true when its conclusion is false. Validity is defineddifferently across communities as disparate as law (e.g., Austin,1832/1995; Waluchow, 2009), economics (e.g., MacPhail, 1998),pattern recognition (e.g., Halkidi, Batistakis, & Vazirgiannis,2002), genetic testing (e.g., Holtzman & Watson, 1997), andmanagement (e.g., Markus & Robey, 1980), to name but a few.

More confusingly, there are standards for validity within edu-cation and psychology that are not specific to measurement. Thisis to draw a distinction between validity for research and validityfor measurement. The former is relevant whenever conclusions areto be drawn on the basis of research evidence. The latter is relevantonly for conclusions that relate specifically to measurement. Va-lidity for research has been theorized from both quantitative (e.g.,Bracht & Glass, 1968; Campbell, 1957; Campbell & Stanley,1966; Cook & Campbell, 1979) and qualitative (e.g., Kvale, 1995;Lather, 1986, 1993; Maxwell, 1992) perspectives.

Validity Standards

The end of the 19th century and the beginning of the 20thcentury witnessed a huge expansion in the science and practice ofEPM, especially in the United States. Inevitably, questions wereraised over the quality of some of the new developments, andcommittees exploring the need for greater standardization andcontrol were established by the APA as early as 1895 (underCattell) and 1906 (under Angell). As explained by Fernberger(1932), early attempts at control were largely unsuccessful.

Some years later, the Standardization Committee of the NorthAmerican National Association of Directors of Educational Re-search surveyed its membership with the intention of establishingconsensus on the kind of information that could demonstrate thesuperiority of one test over another. It provided tentative defini-tions for terms like scale, average, performance, and so on, andproposed a process for standardizing tests, which included thedetermination of both validity and reliability. It defined theseoperations thus:

Two of the most important types of problem in measurement are thoseconnected with the determination of what a test measures, and of howconsistently it measures. The first should be called the problem ofvalidity, the second, the problem of reliability. (Buckingham et al.,1921, p. 80)

Thirty years later, the APA took the lead again, establishing aCommittee on Test Standards, to be chaired by Lee J. Cronbach.As explained in an initial draft prepared for consultation, it wasgiven a remit to prepare “an official statement of the profession”concerning standards of reporting information about tests (seeAPA, 1952, p. 461). Importantly, its final draft, published 2 yearslater, was prepared by a joint committee of the APA, the AERA,and the National Council on Measurements Used in Education(NCMUE; APA, AERA, & NCMUE, 1954). This represented thevery first consensus statement on such matters from the EPMsupracommunity. It included sections on dissemination, interpre-tation, validity, reliability, administration and scoring, scales andnorms. The section on validity, within the first edition of theStandards, was by far the largest. It included an introductory text,within which validity was defined and explained (pp. 13–18),followed by 19 validity standards (pp. 18–28), the very first ofwhich was a standard for talking about validity:

Thi

sdo

cum

ent

isco

pyri

ghte

dby

the

Am

eric

anPs

ycho

logi

cal

Ass

ocia

tion

oron

eof

itsal

lied

publ

ishe

rs.

Thi

sar

ticle

isin

tend

edso

lely

for

the

pers

onal

use

ofth

ein

divi

dual

user

and

isno

tto

bedi

ssem

inat

edbr

oadl

y.

302 NEWTON AND SHAW

Page 3: Standards for talking and thinking about validity

When validity is reported, the manual should indicate clearly whattype of validity is referred to. The unqualified term “validity” shouldbe avoided unless its meaning is clear from the context. (APA et al.,1954, pp. 18–19)

In subsequent comments, it went on to add that no manual shoulduse a blanket statement like “this test is valid,” and this require-ment was elaborated in the second edition:

C1.1. Statements in the manual about validity should refer to thevalidity of particular interpretations or of particular types of decision.ESSENTIAL [Comment: It is incorrect to use the unqualified phrase“the validity of the test.” No test is valid for all purposes or in allsituations or for all groups of individuals.] (APA, AERA, & NCME,1966, p. 15)

This brief comment appeared almost word for word in everysubsequent edition of the Standards. Note that there was no hedg-ing here: The standard was deemed essential and the expression “Itis incorrect” was uncompromising.

It is important to appreciate that this single standard for talkingabout validity spawned two quite distinct conventions: one thatremains a fundamental standard for talking about validity even tothe present-day (thou shalt not refer to TVOTT) and one that isnow seen as a relic of former times (thou shalt use VMLs).Because the story of the latter is less well documented than theformer, the convoluted history of the VML will be discussed belowin some detail, followed by a briefer and more straightforwardaccount of why it is still considered inappropriate to refer toTVOTT.

Thou Shalt (Not) Use VMLs

A generally accepted principle of EPM, evident since at least thefirst few decades of the 20th century, was the idea that scores froma single test might be interpreted in different ways when used fordifferent purposes (see Newton, 2012a). As explained in the firstedition of the Standards, a vocabulary test might be interpreted asmeasure of “present vocabulary” in one context, to make one kindof decision, but in terms of “intellectual capacity” in another, tomake a different kind (APA et al., 1954, p. 13). It was for preciselythis reason that the very first standard for talking about validityinsisted that test manuals should clearly mark distinctions betweendifferent types of validity. Four types of validity were proposed,which mapped onto four aims of testing, which involved four typesof interpretation. The four aims of testing were (a) to determinehow an individual would perform at present in a given universe ofsituations (content validity), (b) to predict an individual’s futureperformance on an external variable (predictive validity), (c) toestimate an individual’s present status on an external variable(concurrent validity), and (d) to infer the degree to which anindividual possesses a trait (construct validity). Thus, for example,content validation would be required in order to defend an inter-pretation in terms of present vocabulary, whereas construct vali-dation would be required in order to defend an interpretation interms of intellectual capacity. The first validity standard thereforerequired the use of VMLs in order to make explicit the kind ofinterpretation that had been validated. Although VMLs had ap-peared frequently in the literature on EPM since at least the 1930s,this new usage was somewhat different and somewhat more sig-

nificant, as we shall now explain. We shall discuss the use ofVMLs, in relation to the Standards, within three phases.

1930s to 1953. The use of VMLs can be found in the literatureas early as the 1930s. Watson and Forlano (1935), for instance,spoke of prima facie validity; Woody and others (1935) referred tocurricular validity; and Richardson (1936) discussed differentialvalidity. A decade later, the concept of face validity was consid-ered in some depth by both Rulon (1946) and Mosier (1947).

Perhaps the first scholars to have used VMLs to deconstruct theconcept of validity were Greene, Jorgensen, and Gerberich (1943),who distinguished between three kinds of validity: curricular va-lidity, statistical validity, and psychological and logical validity.Guilford (1946) cut the validity cake somewhat differently, sug-gesting that it came in two kinds: factorial validity and practicalvalidity. Cronbach (1949), in the first edition of his classic text-book, Essentials of Psychological Testing, distinguished two “ba-sic approaches” based upon logical and empirical analysis. In thatsame edition, he referred not only to empirical validity and logicalvalidity, but also to factorial validity and curricular validity.

It is worth noting that early classifications using the VMLformulation tended not to draw a clear distinction between differ-ent kinds of validity and different approaches to validation. Forinstance, Greene et al. (1943) referred to their three categories asboth “types of test validity” (p. 54) and “types of methods” (p. 55).

1954 to 1984. The use of VMLs was formalized through thework of the committee that developed the first edition of theStandards (APA et al., 1954). The committee identified four typesof validity: predictive, concurrent, content, and construct. Fromfirst to second edition, predictive validity and concurrent validitywere combined within a single category: criterion-related validity.The first three editions presented somewhat mixed messages con-cerning the nature of validity. All three referred both to “types”and to “aspects” when describing their VMLs; the former suggest-ing fairly sharp dividing lines, and the latter suggesting the con-verse. It seems fair to conclude, however, that the first threeeditions of the Standards were generally read to be describingtypes rather than aspects (see, e.g., Guion, 1980). This seemsconsistent with the idea that different approaches to validationwere required for different kinds of interpretation.

1985 to present-day. It was against this fragmented view ofvalidity that Messick (1975) championed a revolution. He insistedthat talking about different kinds of validity—and marking suchdistinctions through the use of VMLs—was extremely misleadingand had the potential to impact adversely upon validation practice.As he explained in two influential articles (Messick, 1980, 1981),important distinctions might become blunted, meaning that super-ficially similar categories are confused (e.g., content validity andconstruct validity), leading to confusion in evidence gathering;uniqueness might become elevated, such that one kind of validity(e.g., content validity), or a small set, might be treated as the wholeof validity; and differences in importance might be overlooked,especially the supporting role played by content and criterionconcerns to construct validation.

The fourth edition of the Standards (AERA, APA, NCME,1985) was clearly influenced by Messick. It stated explicitly thatvalidity was a unitary concept, and although it did not formulatethe rejection of the VML as an explicit validity standard, itsdecision to refer to content-related, criterion-related, andconstruct-related evidence of validity established that standard

Thi

sdo

cum

ent

isco

pyri

ghte

dby

the

Am

eric

anPs

ycho

logi

cal

Ass

ocia

tion

oron

eof

itsal

lied

publ

ishe

rs.

Thi

sar

ticle

isin

tend

edso

lely

for

the

pers

onal

use

ofth

ein

divi

dual

user

and

isno

tto

bedi

ssem

inat

edbr

oadl

y.

303STANDARDS FOR TALKING AND THINKING ABOUT VALIDITY

Page 4: Standards for talking and thinking about validity

implicitly. Both the fourth and fifth editions distinguished clearlybetween aspects and types of validity. They accepted that differentkinds of evidence illuminated different aspects of validity, butinsisted that the different kinds of evidence were not linked todifferent types of validity because there was now only one type ofvalidity (i.e., construct validity). This was the foundation of a newcreed for the EPM supracommunity—modern (Unitarian) validitytheory, we might say, as opposed to traditional (Trinitarian) va-lidity theory.

The definition of validity in the fifth edition of the Standardswas essentially an homage to Messick (1989). It reflected not onlythe depth and sophistication of his thesis, but also his occasionalconfusion (Newton, 2012a). It dropped the traditional three labelsentirely and referred instead to evidence based upon test content,response processes, internal structure, relations to other variables,and consequences of testing (AERA et al., 1999). Its glossarynoted that because all validity is essentially construct validity, eventhe modifier construct was now redundant. Thus, the use of VMLswas officially abandoned, and a fragmented conception of validitywas officially replaced by a unified one.

In summary, during the early years, prior to 1954, there were noofficial statements concerning the use of VMLs. From 1954 to1984, there were explicit standards for using VMLs, and thesewere exemplified in the introductory text of the first three editionsof the Standards. The first edition divided validity into fourtypes—but only four types—reflecting the four aims of testing.The second and third edition collapsed these into just three, whichwere deemed sufficient to cover the full range of possible inter-pretations of test scores. From 1985 the Standards recognized onlyone kind of validity, meaning that VMLs were officially rejected.As explained in the glossary of the fifth edition, the only VMLwith any remaining claim to legitimacy was construct validity, yeteven this label was now superfluous.

Thou Shalt Not Refer to the Validity of the Test

The use of VMLs followed from the principle that conclusionsconcerning validity are never general but relate to specific inter-pretations. Thus, an interpretation of test scores in terms of presentvocabulary might be valid, whereas an interpretation of the sametest scores in terms of intellectual capacity might be invalid.

In the same way, it was accepted that different conclusionsconcerning validity might follow for different groups of individ-uals, or for different situations within which individuals or groupsfound themselves. In short, it is never the test that is to be judgedvalid or invalid, in a general sense, but the interpretation of testscores as measures of a specific attribute under specified condi-tions.

Although the use of VMLs was officially rejected in the mid-1980s, the principle from which it was originally derived remainedintact. Thus, it remained a fundamental tenet of modern validitytheory that validity related to the interpretation of test scores andnot to the test itself. If results from a single test were to beinterpreted in terms of different attributes, then each interpretationwould need to be validated independently. What changed was theassumption that different approaches were required to validatedifferent kinds of interpretation: Modern validity theory decreedthat construct validation was required for all interpretations. In

short, consensus over the inappropriateness of referring to TVOTTwas never shaken.

The Importance of Consensus

The idea of the Standards as a consensus position was funda-mental from the outset, and each new edition reaffirmed thisprinciple. The explicit foci for consensus were, presumably, thestandards themselves, although, by implication, it seems reason-able to conclude that consensus was also reached on the introduc-tory text that accompanied each section, which elaborated points ofprinciple from which the standards were derived. Although thefourth edition said of the introductory text that it should not beinterpreted as imposing additional standards, it seems hard to avoidthe conclusion that in promulgating a particular view of validity,standards for thinking about validity were established just as muchby the introductory text as by the validity standards themselves.Note, for instance, the uncompromising style adopted by succes-sive editions, illustrated in the opening sentences of successivevalidity sections:

Validity information indicates to the test user the degree to which thetest is capable of achieving certain aims. (APA et al., 1954, p. 13)

Validity refers to the degree to which evidence and theory support theinterpretations of test scores entailed by proposed uses of tests.(AERA et al., 1999, p. 9)

There is no hint here that the views expressed might represent atentative consensus or a compromise position, or even that theremight be any doubt at all over their legitimacy. No such hints areto be found in other passages or other editions. The Standardstherefore present explicit (practical) standards, including standardsfor talking about validity, prefaced by more implicit (conceptual)standards for thinking about validity.

Over the past couple of decades, the claim that there now existsa new consensus over the nature of validity—embodied in thefourth and, especially, the fifth edition—has repeatedly been as-serted (e.g., Angoff, 1988; Cronbach, 1989; Downing, 2003; Dun-nette, 1992; Kane, 2001; Shepard, 1993; Sireci, 2009). Moss(1995, p. 6) went so far as to describe this as “a close to universalconsensus among validity theorists.” This alleged consensus reas-serts the traditional principle that it is wrong to refer to TVOTTbecause tests are not the kind of thing that can be valid or invalid.It also asserts that there is now only one kind of validity, constructvalidity, which renders the use of VMLs inappropriate.

Validity Custom and Practice

In response to this assertion of consensus, we now present a newbody of evidence on the way in which VMLs have been used overthe years, particularly during the two key phases identified above(pre- and post-1985). It highlights a disjunction between standardsand custom and practice, that is, between how VMLs ought to havefeatured in the literature of EPM and how they actually did. Thisis followed by a much shorter section that simply highlights themore widely acknowledged fact that members of this supracom-munity are still wont to refer to TVOTT. Possible reasons for thesedisjunctions are considered subsequently.

Thi

sdo

cum

ent

isco

pyri

ghte

dby

the

Am

eric

anPs

ycho

logi

cal

Ass

ocia

tion

oron

eof

itsal

lied

publ

ishe

rs.

Thi

sar

ticle

isin

tend

edso

lely

for

the

pers

onal

use

ofth

ein

divi

dual

user

and

isno

tto

bedi

ssem

inat

edbr

oadl

y.

304 NEWTON AND SHAW

Page 5: Standards for talking and thinking about validity

The Proliferation of VMLs Prior to 1985

In addition to those VMLs mentioned above, a range of differentkinds of validity had been proposed even before the publication ofthe first edition of the Standards. These included intrinsic validity(Gulliksen, 1950); internal validity and external validity (Guttman,1950); synthetic validity, generalized validity, and situational va-lidity (Lawshe, 1952). Others were proposed shortly afterward,including convergent validity and discriminant validity (see Camp-bell & Fiske, 1959); internal validity, substantive validity, struc-tural validity, and external validity (see Loevinger, 1957); traitvalidity and nomological validity (see Campbell, 1960).

Although the Standards were never intended as a textbook, itstill seems a little odd that the early proliferation of VMLs was notexplicitly recognized in the 1954 edition, let alone the 1966 revi-sion. Indeed, although new types of validity continued to beintroduced in the wake of the first edition, none was incorporatedin the second or the third. Admittedly, a footnote to the 1974revision did, at least, allude to developments within the widerliterature:

Many other terms have been used. Examples include synthetic valid-ity, convergent validity, job-analytic validity, rational validity, andfactorial validity. In general, such terms refer to specific proceduresfor evaluating validity rather than to new kinds of interpretativeinferences. (APA, AERA, & NCME, 1974, p. 26)

This was not, strictly speaking, correct, though; for instance, eventhose VMLs listed in the footnote were not simply alternativeprocedures. Rational validity, for example, was more of an over-arching category, akin to logical validity, with links to curricularvalidity and content validity. Then there were other well-knownvalidities that were neither present on the list nor could properly bedescribed as procedures, such as trait validity and nomologicalvalidity (Campbell, 1960), and incremental validity (Sechrest,1963).

Inevitably, we would expect the publication of a statement thatclaimed to express an official statement of the professions togenerate a certain amount of debate and divergent opinion withinthe wider literature. Not only did this occur, it resulted in theinvention of a multiplicity of new VMLs. Cattell (1964, p. 7), forinstance, bemoaned the “motley list of ‘validity’ terms” in theStandards. He claimed that in promulgating them, the committeehad been unduly successful in establishing a professional consen-sus, given that the concept was still in its infancy. He argued thatseveral existing uses were either unfruitful (e.g., construct validity)or superfluous (e.g., face validity, predictive validity, concurrentvalidity, content validity) in the sense of not being central to theconcept of validity and better described using other terms. Insearch of a “more basic set of concepts,” he proposed a suite ofnew VMLs along three dimensions: concrete validity to conceptvalidity, natural validity to artifactual validity, and direct validityto indirect validity. His proposals had little impact on the widerliterature. Nor did the plethora of VML-based taxonomies thatwere to follow.

Cureton (1965), for instance, distinguished between three kindsof criterion validity: raw validity, the correlation between a pre-dictor measure and a (sui generis) criterion measure; true validity,the correlation between a predictor measure and estimated truescores on a (constructed) criterion measure; and intrinsic validity,the correlation between estimated true scores on a predictor mea-

sure and estimated true scores on a criterion measure. Lord andNovick (1968) drew a distinction between empirical validity andtheoretical validity: empirical validity referring to the degree ofassociation between the focal measurement and some other ob-servable measurement, and theoretical validity referring to thecorrelation of an observed variable with a theoretical construct orlatent variable, of which construct validity was a special case.Carver (1974) contrasted psychometric validity, concerning theidentification of cross-sectional differences between individuals,with edumetric validity, concerning the identification of longitu-dinal changes within individuals over time. Popham (1978) pro-posed three new types of validity for criterion-referenced tests:descriptive validity, the extent to which the test measures what itsdescriptive scheme contended that it is measured; functional va-lidity, the extent to which the test fulfilled its intended function;and domain-selection validity, the extent to which the behavioraldomain was wisely chosen.

Beyond these alternative VML-based taxonomies, many newtypes of validity were proposed in the period between 1954 and1984: for instance, domain validity (Tryon, 1957a, 1957b), com-mon sense validity (Shaw & Linden, 1964, from English & Eng-lish, 1958), occupational validity (Bemis, 1968), cash validity(Dick & Hagerty, 1971), single-group validity (Boehm, 1972),consensual validity (McCrae, 1982, from Rosenberg, 1979), deci-sion validity (Hambleton, 1980), intrinsic rational validity andperformance validity (Ebel, 1983), and so on. Thus, despite veryclear standards for talking about validity from 1954 to 1984—which recognized content validity, construct validity, andcriterion-related validity but no other VMLs—a very large numberof new VMLs came to be proposed. In fact, many of the “biggesthitters” of their day—Campbell, Loevinger, Cattell, Cureton, Lord,Novick, Carver, Popham, Tryon, Hambleton, Ebel, and manyothers too—contributed to this proliferation.

The Continued Proliferation of VMLs Following 1985

As we shall now demonstrate, the VML formulation continuedto be used long after the fourth edition of the Standards had beenpublished. In fact, new VML-based taxonomies and new VMLtypes continued to be proposed too.

The continued use of VMLs in the wider literature. Toinvestigate the use of VMLs in contemporary research reports, weanalyzed titles of articles, from 22 journals within the field ofEPM, that had been published between January 1, 2005, andDecember 31, 2010.1 This involved using the Internet searchengine attached to the official website of each journal, restricted tothe specified period, with validity in the title field. Titles including

1 Applied Measurement in Education; Applied Psychological Measure-ment; Assessment; Assessment and Evaluation in Higher Education; As-sessment in Education: Principles, Policy and Practice; Educational andPsychological Measurement; Educational Assessment; Educational As-sessment, Evaluation and Accountability; Educational Measurement: Is-sues and Practice; European Journal of Psychological Assessment; Inter-national Journal of Selection and Assessment; Journal of AppliedPsychology; Journal of Educational Measurement; Journal of PersonalityAssessment; Journal of Psychoeducational Assessment; Language Assess-ment Quarterly; Language Testing; Measurement and Evaluation in Coun-seling and Development; Measurement in Physical Education and ExerciseScience; Measurement: Interdisciplinary Research and Perspectives; Psy-chological Assessment; and Psychometrika.

Thi

sdo

cum

ent

isco

pyri

ghte

dby

the

Am

eric

anPs

ycho

logi

cal

Ass

ocia

tion

oron

eof

itsal

lied

publ

ishe

rs.

Thi

sar

ticle

isin

tend

edso

lely

for

the

pers

onal

use

ofth

ein

divi

dual

user

and

isno

tto

bedi

ssem

inat

edbr

oadl

y.

305STANDARDS FOR TALKING AND THINKING ABOUT VALIDITY

Page 6: Standards for talking and thinking about validity

VMLs were exported for subsequent analysis. The intention wassimply to count how many VMLs appeared in titles of articlesfrom those journals, published between 2005 and 2010. Occasion-ally, more than one appeared in the same title, whereby all occur-rences were counted.

As a point of reference, it useful to note that there were 131 titlesthat referred to validity but without any VML and a further 40 thatreferred to the validity without any VML.

Table 1 presents results for 32 VMLs that appeared in titles fromthe 22 measurement journals between 2005 and 2010. For fivejournals, no titles including VMLs were identified.2 For sevenjournals, more than 20 titles including VMLs were identified.3 Atotal of 208 uses were identified, or 144 if construct and construct-related are omitted as allowable VMLs. Two kinds of VML wereexcluded from this analysis: referent modifiers and simple rela-tional modifiers.4

In the top 13 (most frequently observed VMLs), each withat least two uses, we note one still officially sanctionedVML, construct(-related); four ex-officially sanctioned VMLs,criterion(-related), predictive, concurrent, content; five very wellworn VMLs, incremental, convergent, discriminant, factorial,structural; and one new contender, consequential. Outside the top13, we note a host of more obscure VMLs: some as old as the hills,such as differential, internal, synthetic; others perhaps even mak-ing their maiden voyage, such as extratest, operational, elemental.

It is tricky to interpret the significance of these results inisolation. So the same analysis was run for the period betweenJanuary 1, 1975, and December 31, 1980. This period capturedarticles that would have been written during the pre-Unitarianphase; before Messick (1980) argued that the use of VMLs shouldbe dropped; and before the Standards changed its nomenclature forvalidity. Unfortunately, there were far fewer measurement journalspublished back then, so the analysis was restricted to just three.5

The comparison of VML prevalence between 1975–1980 and2005–2010 was complicated by the fact that validity was referredto more frequently in titles from the earlier years within these threejournals. Thus, from 1975–1980, 56 articles referred to validitywithout mentioning a VML, compared to 31 from 2005–2010;likewise, an additional 41 referred to the validity without a VMLfrom 1975–1980, compared to seven from 2005–2010. Results foreach of the three journals are presented in Table 2.

For the journal Educational and Psychological Measurement,the picture seems to be one of a reduction in the use of VMLs overtime, from 86 to 24. The picture is less clear for the Journal ofApplied Psychology and the Journal of Personality Assessment,however, with the Journal of Applied Psychology remaining fairlystable (14 vs. 11) and the Journal of Personality Assessment rising(19 vs. 28). If construct validity and construct-related validity areexcluded from these figures, they become 78 versus 14 (Educa-tional and Psychological Measurement), 12 versus 10 (Journal ofApplied Psychology), and 12 versus 17 (Journal of PersonalityAssessment).

In summary, although there may be some indication of a pos-sible reduction in the use of VMLs over time, this evidence is notoverwhelming, and many can still be found gracing the pages ofthe most respected measurement journals. Once again, it is impor-tant to remember that these figures do not relate to the number ofpublished articles that referred to VMLs; they relate simply to the

number of articles with VMLs in their title. So the figures are avery conservative estimate of prevalence.

The continued proliferation of new VMLs. Not only doVMLs continue to be used repeatedly in research reports, new

2 Applied Psychological Measurement, Educational Measurement: Is-sues and Practice, Journal of Educational Measurement, Psychometrika,and Language Assessment Quarterly.

3 Psychological Assessment, Journal of Personality Assessment, Assess-ment, Educational and Psychological Measurement, International Journalof Selection and Assessment, Journal of Psychoeducational Assessment,and European Journal of Psychological Assessment.

4 Certain VMLs, which we have called simple relational modifiers, aretypically used simply to indicate the comparison of validities rather than toidentify a particular way of thinking about validity. They include compar-ative, relative, maximum, initial, etc. The distinction between simple rela-tional validities and more substantive ones was not always clear-cut:differential validity, for instance, is sometimes used in a simple relationalway, but is frequently used in a more substantive manner; incrementalvalidity has a relational component, but seemed sufficiently substantive tobe included. Other VMLs, which we have called referent modifiers, relatesimply to the referent of the validity claim. They include test, item, score,scale, assessment center, interviewer, questionnaire, instrument, measure-ment, argument, etc. (Item validity is sometimes used in a nonreferent wayto capture the extent to which intended cognitive processes are elicited byan item.)

5 Educational and Psychological Measurement, Journal of Applied Psy-chology, and Journal of Personality Assessment.

Table 1The Prevalence of Validity Modifier Labels Within RecentJournal Articles

LabelFrequency(n � 208) %

Construct 61 29.3Incremental 27 13.0Predictive 22 10.6Convergent 17 8.2Discriminant 14 6.7Criterion-related 12 5.8Concurrent 9 4.3Criterion 9 4.3Factorial 8 3.8Construct-related 3 1.4Structural 3 1.4Content 2 1.0Consequential 2 1.0Differential 1 0.5Internal 1 0.5Cross-cultural 1 0.5Cross- 1 0.5External 1 0.5Population 1 0.5Consensual 1 0.5Diagnostic 1 0.5Extratest 1 0.5Incremental criterion-related 1 0.5Operational 1 0.5Local 1 0.5Concurrent criterion-related 1 0.5Criteria 1 0.5Cross-age 1 0.5Elemental 1 0.5Predictive criterion-related 1 0.5Synthetic 1 0.5Treatment 1 0.5

Thi

sdo

cum

ent

isco

pyri

ghte

dby

the

Am

eric

anPs

ycho

logi

cal

Ass

ocia

tion

oron

eof

itsal

lied

publ

ishe

rs.

Thi

sar

ticle

isin

tend

edso

lely

for

the

pers

onal

use

ofth

ein

divi

dual

user

and

isno

tto

bedi

ssem

inat

edbr

oadl

y.

306 NEWTON AND SHAW

Page 7: Standards for talking and thinking about validity

VMLs are continuously being invented. An unstructured survey ofthe wider literature, which sought to identify as many new VMLsas possible within the field of EPM, identified a whole host.6 Theyappeared within new VML-based taxonomies and as free-standingadditions to the literature. They included general validity, specificvalidity (Tenopyr, 1986); representational validity, elaborative va-lidity (Foster & Cone, 1995); prospective validity, retrospectivevalidity (Jolliffe et al., 2003); formative validity, summative va-lidity (Allen, 2004); site-validity, system-validity (Freebody &Wyatt-Smith, 2004); design validity, interpretive validity (Briggs,2004); diagnostic validity (Willcutt & Carlson, 2005); translationvalidity (Trochim, 2006); structural validity, elemental validity(Hill, Dean, & Gaffney, 2007); cognitive validity, context validity,scoring validity (Shaw & Weir, 2007); manifest validity, semanticvalidity (Larsen, Nevo, & Rich, 2008); operational validity(Lievens, Buyse, & Sackett, 2008); extratest validity (Hopwood,Baker, & Morey, 2008); decision validity (Brookhart, 2009);cross-age validity (Karelitz, Parrish, Yamada, & Wilson, 2010);retrospective validity (Evers, Sijtsma, Lucassen, & Meijer, 2010);generic validity, psychometric validity, and relational validity(Guion, 2011).

The Validity of the Test

Increasingly, in recent years, writers have expressed sentimentsranging from embarrassment to exasperation that measurementspecialists continue routinely to disregard the original validitystandard. For example, in a presidential address to the NCME,Frisbie (2005) lamented that validity continued to be the mostmisunderstood or widely misused of all terms, consistently beingused in ways that contradicted the consensual understanding. Hequoted numerous examples from the literature of authors usingphrases like the test will be valid or the validity of the test or testvalidity. Frisbie was not the first, nor the last, to have made thisobservation. Twenty years earlier, Lawshe (1985) had observedessentially the same thing.

6 The survey only counted VMLs that had been published in “respect-able” measurement books or journals. For instance, it did not count termslike intentional validity, observation validity, and representation validity,which had been found on the Internet but could not be traced to atraditional publication.

Table 2Comparative Prevalence of Validity Modifier Labels Over Time

1975–1980 2005–2010

EdPM(n � 86)

JPA(n � 19)

JAP(n � 14)

Total(n � 119)

EdPM(n � 24)

JPA(n � 28)

JAP(n � 11)

Total(n � 63)

Label n % n % n % n % n % n % n % n %

Construct 8 9.3 7 36.8 2 14.3 17 14.3 9 37.5 11 39.3 0 0.0 20 31.7Predictive 27 31.4 1 5.3 0 0.0 28 23.5 3 12.5 1 3.6 3 27.3 7 11.1Incremental 2 2.3 0 0.0 0 0.0 2 1.7 1 4.2 4 14.3 1 9.1 6 9.5Convergent 3 3.5 1 5.3 0 0.0 4 3.4 1 4.2 2 7.1 1 9.1 4 6.3Criterion 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 4 14.3 0 0.0 4 6.3Concurrent 12 14.0 1 5.3 0 0.0 13 10.9 1 4.2 2 7.1 0 0.0 3 4.8Discriminant 7 8.1 3 15.8 0 0.0 10 8.4 2 8.3 0 0.0 1 9.1 3 4.8Criterion-related 1 1.2 0 0.0 0 0.0 1 0.8 1 4.2 1 3.6 1 9.1 3 4.8Construct-related 0 0.0 0 0.0 0 0.0 0 0.0 1 4.2 0 0.0 1 9.1 2 3.2Factorial 16 18.6 0 0.0 0 0.0 16 13.4 1 4.2 0 0.0 0 0.0 1 1.6Internal 1 1.2 0 0.0 0 0.0 1 0.8 1 4.2 0 0.0 0 0.0 1 1.6Cross-cultural 0 0.0 1 5.3 0 0.0 1 0.8 0 0.0 1 3.6 0 0.0 1 1.6Cross- 0 0.0 0 0.0 0 0.0 0 0.0 1 4.2 0 0.0 0 0.0 1 1.6External 0 0.0 0 0.0 0 0.0 0 0.0 1 4.2 0 0.0 0 0.0 1 1.6Population 0 0.0 0 0.0 0 0.0 0 0.0 1 4.2 0 0.0 0 0.0 1 1.6Consensual 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 1 3.6 0 0.0 1 1.6Extratest 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 1 3.6 0 0.0 1 1.6Incremental criterion-related 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 1 9.1 1 1.6Operational 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 1 9.1 1 1.6Local 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 1 9.1 1 1.6Differential 0 0.0 0 0.0 7 50.0 7 5.9 0 0.0 0 0.0 0 0.0 0 0.0Content 2 2.3 1 5.3 1 7.1 4 3.4 0 0.0 0 0.0 0 0.0 0 0.0Domain 3 3.5 0 0.0 0 0.0 3 2.5 0 0.0 0 0.0 0 0.0 0 0.0Single-group 0 0.0 0 0.0 3 21.4 3 2.5 0 0.0 0 0.0 0 0.0 0 0.0Diagnostic 0 0.0 1 5.3 0 0.0 1 0.8 0 0.0 0 0.0 0 0.0 0 0.0Concurrent criterion 1 1.2 0 0.0 0 0.0 1 0.8 0 0.0 0 0.0 0 0.0 0 0.0Congruent 1 1.2 0 0.0 0 0.0 1 0.8 0 0.0 0 0.0 0 0.0 0 0.0Discriminative 0 0.0 1 5.3 0 0.0 1 0.8 0 0.0 0 0.0 0 0.0 0 0.0Edumetric 1 1.2 0 0.0 0 0.0 1 0.8 0 0.0 0 0.0 0 0.0 0 0.0Empirical 1 1.2 0 0.0 0 0.0 1 0.8 0 0.0 0 0.0 0 0.0 0 0.0Face 0 0.0 1 5.3 0 0.0 1 0.8 0 0.0 0 0.0 0 0.0 0 0.0Interpretative 0 0.0 1 5.3 0 0.0 1 0.8 0 0.0 0 0.0 0 0.0 0 0.0Job component 0 0.0 0 0.0 1 7.1 1 0.8 0 0.0 0 0.0 0 0.0 0 0.0

Note. The table displays frequency of occurrence. EdPM � Educational and Psychological Measurement; JPA � Journal of Personality Assessment;JAP � Journal of Applied Psychology.

Thi

sdo

cum

ent

isco

pyri

ghte

dby

the

Am

eric

anPs

ycho

logi

cal

Ass

ocia

tion

oron

eof

itsal

lied

publ

ishe

rs.

Thi

sar

ticle

isin

tend

edso

lely

for

the

pers

onal

use

ofth

ein

divi

dual

user

and

isno

tto

bedi

ssem

inat

edbr

oadl

y.

307STANDARDS FOR TALKING AND THINKING ABOUT VALIDITY

Page 8: Standards for talking and thinking about validity

In an analysis of reviews from the 16th Mental MeasurementsYearbook, published in 2005, Cizek et al. (2008) judged that 30%of all reviews referred to validity as a property of a test (cf. a score,inference, or interpretation), which corresponded to 55% of re-views that could be classified definitively.

Our own research into the prevalence of VMLs noted the use ofthe validity in titles of articles published between 1975–1980 and2005–2010, with some indication of greater frequency of usage inthe earlier period. We also observed the use of (what we termed)referent modifiers; not just test validity, but item validity, scorevalidity, scale validity, assessment center validity, interviewervalidity, questionnaire validity, instrument validity, and measure-ment validity. All of these uses would appear to be out of kilterwith the claim that tests are not the kind of thing that can be validor invalid.

To provide a little bit more insight into how valid is used withinjournal articles, we conducted a simple case study, based uponabstracts published in a leading journal of the field, Educationaland Psychological Measurement. The online abstracts of articlespublished within two distinct periods were searched for the occur-rence of the term valid. Each occurrence was coded, in terms ofwhat, exactly, was being referred to as valid.7 Results are pre-sented in Table 3.

It is interesting to note that none of the 90 occurrences referredto a valid interpretation and only one referred to a valid inference.A substantial number of occurrences were coded as valid instru-ment (including instrument, test, subtest, test form, scale), andeven more were coded as valid measure (including measure,measurement approach). The latter often read as though it weretantamount to valid test. The trend toward reference to validmeasurement and valid scores (including scores, results, data)may, perhaps, hint at somewhat greater avoidance of reference totest validity during the later period.

Of particular interest was the high prevalence of valid predictor(including predictor, predictions, criterion estimates) during bothperiods. May we refer to a valid predictor? Is it implicitly rejectedby the Standards in the same way as a valid test? One would havethought so, although we have never noticed the claim statedexplicitly. Equally, we would assume that talk of a valid item isdismissed, despite the concept of item validity having a pedigree inEPM that dates back long before the first edition was penned (e.g.,Lindquist, 1936).

Explaining the Disjunction Between Standards andCustom and Practice

There are at least three major categories of explanation for thedisjunction between standards for talking about validity and howvalidity is actually talked about in the published literature:

• intentional misuse—understanding the consensus conception,and accepting it, but choosing to use nonconsensus language (i.e.,choosing to disregard standards for talking about validity but notstandards for thinking about validity);

• lack of awareness or misunderstanding—not understandingthe consensus conception, and using nonconsensus language (i.e.,not intentionally disregarding standards for talking about or think-ing about validity); and

• genuine divergence—understanding the consensus concep-tion, but rejecting it, and choosing to use nonconsensus language(i.e., choosing to disregard standards for talking about and thinkingabout validity).These three categories provide a useful structure for reflectingupon the evidence presented above; so, for each standard in turn,we shall illustrate each category of explanation.

Thou Shalt Not Refer to the Validity of the Test

We begin by illustrating the range of reasons that we haveencountered for referring to validity as though it were a property oftests.

Intentional misuse. It is not uncommon for writers to note,apologetically, that although they fully accept the consensus po-sition on validity, they will lapse into loose talk because it is easieror more comfortable to do so: for example, “We sometimes speakof the ‘validity of a test’ for the sake of convenience, but it is morecorrect to speak of the validity of the interpretation and use to bemade of the results” (Miller, Linn, & Gronlund, 2009, p. 72).Many authors agree that when measurement specialists refer toTVOTT they often do so “elliptically” (Kane, 2009, p. 40), or as“shorthand” (Guion, 2009, p. 467; Landy, 1986, p. 1186; Zumbo,2009, p. 67), or merely as “a matter of convenience” (Reynolds,Livingston, & Willson, 2010, p. 124). In such cases, so the argu-ment goes, there is no rejection of standards for thinking aboutvalidity, only of standards for talking about validity.

Lack of awareness or misunderstanding. A more worryingexplanation of why measurement specialists still so frequentlyrefer to TVOTT is that they are ignorant of validity standards, orfail to understand the principles underlying them. Although therehave been no systematic investigations into this possibility, manysuspect it to be true (e.g., Hubley & Zumbo, 1996). Commentingmore narrowly upon the state of personnel psychology, Guion(2009) suggested, in exasperation, that even the traditional notion

7 The research was originally intended to involve two 20-year periods:beginning of 1993 to end of 2012 and beginning of 1961 to end of 1980.In fact, the online abstracts for Educational and Psychological Measure-ment are only stored electronically, and therefore only searchable, back to1974. As it happened, the period from 1974 to 1980 returned moreinstances of valid than the period from 1993 to 2012, so this sufficed forrough comparative purposes. During the later period, no abstract containedthe term valid more than once. During the earlier period, it was notuncommon for valid to appear more than once in a single abstract. Wherethere was more than one occurrence, only the first was coded. The codingwas very straightforward and almost always unambiguous.

Table 3Referents of the Term Valid Within Educational andPsychological Measurement

January 1974 toJanuary 1980

January 1993 toNovember 2012

ReferentFrequency(n � 49) %

Frequency(n � 41) %

Indicator 4 8.2 0 0.0Instrument 12 24.5 5 12.2Measure 10 20.4 9 22.0Measurement 0 0.0 5 12.2Predictor 14 28.6 6 14.6Scores 1 2.0 9 22.0Other 8 16.3 7 17.1

Thi

sdo

cum

ent

isco

pyri

ghte

dby

the

Am

eric

anPs

ycho

logi

cal

Ass

ocia

tion

oron

eof

itsal

lied

publ

ishe

rs.

Thi

sar

ticle

isin

tend

edso

lely

for

the

pers

onal

use

ofth

ein

divi

dual

user

and

isno

tto

bedi

ssem

inat

edbr

oadl

y.

308 NEWTON AND SHAW

Page 9: Standards for talking and thinking about validity

of validity was still not yet understood, let alone the modern one.He wondered whether this was because members of his professionhad simply not studied the relevant literature.

Genuine divergence. It is important to acknowledge thatsome people who refer to TVOTT will do so intentionally, con-sistent with beliefs that depart from the consensus position. Inarticles reminiscent of the Hans Christian Andersen fable TheEmperor’s New Clothes, Borsboom, Mellenbergh, and vanHeerden (2004) and Borsboom, Cramer, Keivit, Scholten, andFranic (2009) argued forcefully against the view of validity as aproperty of interpretations, claiming that validity is necessarily aproperty of tests. Even more controversially, they claimed that thisis the de facto consensus view among measurement specialists:rejected by construct validity theorists, but embraced by “the restof the inhabitants of the scientific world” (Borsboom et al., 2009,pp. 163–164). Not having cited the Standards in either article, it isunclear whether Borsboom and colleagues appreciated that theywere not simply challenging an informal consensus amongstmodern-day construct validity theorists, but the official position ofthe EPM supracommunity since the Standards was first penned.Interestingly, in their wake, other dissenters have made similarviews known, including Lissitz and Samuelsen (2007).

Immediate reflections. Oddly enough, it is technically possi-ble to refer to TVOTT without disregarding the original validitystandard. Note how each edition specified that it was incorrect touse the “unqualified” phrase TVOTT. Presumably, then, if thephrase is qualified, it ought to be acceptable to speak of testvalidity after all. Note the following, from the Standards andMessick (1989), respectively:

If the validity of the test can reasonably be expected to be different insubgroups which can be identified when the test is given, the manualshould report the validity for each group separately or should reportthat no difference was found. (APA et al., 1954, p. 26)

First, a test that is valid for a job or task in one setting might be invalid(or have a different validity) for the same job or task in a differentsetting, which constitutes situational specificity per se. Second, a testthat is valid for one job might be invalid (or have a different validity)for another job albeit in the same setting, which is better described asjob specificity. (Messick, 1989, p. 82)

Although even Messick often used phrases like test validitywithout explicit qualification, these two examples are useful inhighlighting the possibility that people who refer to TVOTT maydo so with a clear, albeit largely implicit, presumption of qualifi-cation. It certainly seems that when Borsboom refers to TVOTT,he fully accepts that the test might be valid for one particular groupof students while invalid for another, or valid for one interpretationand use of results yet invalid for another (see Borsboom, 2012;Borsboom & Mellenbergh, 2007, pp. 104–105).

More generously, still, if the term test is interpreted to meanmeasurement procedure, in its broadest sense—including instru-ment, administration procedure, scoring procedure, and intendedinterpretation—then this may dissolve the notion of divergencebetween the two camps entirely, rendering debate over the legiti-macy of terms like test validity something of a red herring (New-ton, 2012b). However, the debate is not quite so easily dissipatedbecause there are actually two further standards for talking andthinking about validity lurking here:

3. Thou shalt use the term validity when evaluating decisionmaking procedures (i.e., it is correct to speak of thevalidity of the use of test scores, as well as the validity oftheir interpretation).

4. Thou shalt use the term validity when evaluating impactsfrom measuring (i.e., it is correct to speak of the validityof the overall testing policy).

Scholars like Borsboom and Mellenbergh (2007) reject both ofthese standards, claiming that they reflect professional issues thatought to be described with different terms. Conversely, they pro-pose, the concept of validity, and therefore talk about validity,ought to be restricted to scientific issues, that is, to issues ofmeasurement (see also Cizek, 2012; Scriven, 2002). However, it isworth noting that some scholars have drawn precisely the samedistinction between professional and scientific interests, yet havereached precisely the opposite conclusion, that is, that validityought to be restricted to professional talk of decision making andnot be used for scientific talk of measurement (e.g., Gaylord &Stunkel, 1954). Successive editions of the Standards have alwaysdiscussed decision making as a part of validity, particularly as thefocus for criterion-related validation. The inclusion of impacts,however, is a more recent and a more controversial addition. Eventhe fifth edition of the Standards is ambiguous on this matter.Newton (2012a) has argued that it upholds the third standard, butnot necessarily the fourth.

Thou Shalt (Not) Use VMLs

We have not uncovered any explicit discussion on reasons forthe proliferation of new VMLs beyond the few defined in theStandards (either between 1954 and 1984 or subsequently), so thefollowing sections focus particularly upon explanations that havebeen offered for the continued use of traditional VMLs followingtheir official rejection in 1985.

Intentional misuse. The use of VMLs has not been debatedwidely in the literature, although a number of arguments for andagainst have been proposed, particularly in relation to contentvalidity. Yalow and Popham (1983), for instance, warned thatrelabeling the term might substantially reduce attention to contentcoverage within validation. Fifteen years later, Sireci (1998) sug-gested that this had indeed occurred. Shepard (1993) took a con-trary view, however, believing that the replacement of the “xvalidity” formulation with “x-related evidence of validity” in thefourth edition of the Standards had failed to flag the importantconceptual change sufficiently, noting the persistence of inappro-priate conceptions even within measurement journals from the1990s (see also Moss, 1995).

Despite having feet firmly rooted in modern validity theory,Sireci (1998) staunchly defended the continued use of the contentvalidity label, arguing that if validity is understood in the everydaysense of the logical grounding of a claim, then the VML formu-lation is still technically correct; new terms for describing thefamily of issues and procedures fundamental to content-relatedevaluation—content relevance, content representation, and do-main definition—fail to cohere as a group; and the idea of contentvalidity is far easier for nonpsychometric audiences to compre-hend.

Thi

sdo

cum

ent

isco

pyri

ghte

dby

the

Am

eric

anPs

ycho

logi

cal

Ass

ocia

tion

oron

eof

itsal

lied

publ

ishe

rs.

Thi

sar

ticle

isin

tend

edso

lely

for

the

pers

onal

use

ofth

ein

divi

dual

user

and

isno

tto

bedi

ssem

inat

edbr

oadl

y.

309STANDARDS FOR TALKING AND THINKING ABOUT VALIDITY

Page 10: Standards for talking and thinking about validity

As it happens, each of Sireci’s three reasons for continuing touse the term content validity might be challenged. First, revertingto an everyday conception is inconsistent with the attempt tospecify a precise technical meaning for validity, specific to EPM,which the EPM communities have aspired to for the best part of acentury. It is also fair to say that there are many everyday sensesof validity, so deference to a particular one might be consideredarbitrary. So the idea that common sense furnishes a satisfactorytechnical meaning, consistent with the use of the term contentvalidity, seems problematic. Second, the family of issues andprocedures fundamental to content-related evaluation could, quitestraightforwardly, be grouped through the use of content ratherthan validity. So this argument, too, seems at least debatable.Third, the suggestion that content validity is easier for lay audi-ences to understand is presumably meant to imply that the tradi-tional caricature of validity is easier to understand than the modernview. This may be true, but whether it is appropriate to continuepromulgating a spurious view of validity is questionable, evenwhen communicating with validity novices. Was it not the tradi-tional oversimplification of validity that got us into trouble in thefirst place (see Dunnette & Borman, 1979)?

The more general claim that content validity is easy to under-stand is also questionable, in light of the very many differentversions of content validity that are still in circulation, as theInternet bears testament to. The fact that Sireci (2007) failed torecognize the version defended by Lissitz and Samuelsen (2007)suggests that the idea of content validity is not quite as unprob-lematic as might be assumed. In short, there are certainly questionsto be raised in response to a purely pragmatic defense of thecontinued use of VMLs.

Sireci is certainly not alone in claiming educational benefitsfrom continuing to use VMLs. In their textbook on psychologicaltesting, McIntire and Miller (2007) began their discussion ofvalidity with reference to the modern view and explained the fivesources of evidence from the fifth edition of the Standards. How-ever, three subsequent chapters focused explicitly upon the tradi-tional characterization from the second edition: content validity, inwhich they included face validity; criterion-related validity, in-cluding predictive validity and concurrent validity; and constructvalidity, including discriminant validity and convergent validity.They justified their traditional presentation on the basis that “astudent would not be able to interpret more than 100 years oftesting literature, including case law, without a strong understand-ing of the three traditional types of validity” (p. 224).

Lack of awareness or misunderstanding. Cizek et al. (2008)conducted one of the very small number of empirical studies intothe appropriation of standards for talking about validity, basedupon an analysis of reviews prepared for the 16th Mental Mea-surements Yearbook. They judged that only seven of the 283reviews used language consistent with the modern view and thatthe most common convention was to use language consistent withthe traditional view (i.e., making reference to types of validity).They noted an explanation that had been offered for a similarphenomenon, some years earlier, by Shepard (1993), that practic-ing psychometricians do not actually understand the theory thatthey claim to be applying. This explanation is consistent with anobservation from Camara and Lane (2006) that, in many instances,practitioners may be unfamiliar with their professional standards

and have little exposure to new developments in assessment duringtheir graduate training.

Cizek et al. (2008) also offered a slightly different kind ofexplanation, that practitioners may fail to read any deep signifi-cance into the language that they use, such that the distinctionbetween, say, content validity and content-related evidence ofvalidity is not especially salient for them. Their knowledge ofmodern validity theory might not be entirely lacking, but theymight still fail to appreciate why (or even that) using the languageof traditional validity theory is problematic. In other words, theymay have begun to appropriate the new standard for thinking aboutvalidity, without having appropriated the new standard for talkingabout validity, that is, the rejection of VMLs. Cizek et al. recom-mended a more aggressive promulgation of such standards infuture years to overcome this challenge.

Genuine divergence. From inspection of the literature alone,it would be hard to tell whether those who simply used VMLs didso with a clear understanding of the Standards and, therefore, withan appreciation of how their use related to the consensus position.On the other hand, we might hope that those who ventured toinvent new VMLs, who aspired to be validity scholars, would doso with at least some appreciation of the manner in which theywere diverging from established standards for talking about valid-ity. The following sections reflect upon the use and invention ofVMLs during two phases: pre- and post-1985.

1954 to 1984. For 3 decades, the consensus of the EPMsupracommunity remained essentially unchanged: There were ba-sically just three kinds of validity—content, construct, and crite-rion—and evaluators should make explicit which of the three theywere talking about whenever validity was to be claimed. Naturally,there were scholars who explicitly disagreed with the consensusposition and who proposed new VMLs to correct it (e.g., Cattell,1964). More interesting, though, were the scholars whose newVMLs were proposed in order to elaborate upon, rather than tochallenge, the Standards (e.g., Campbell, 1960; Campbell & Fiske,1959; Lawshe, 1952; Sechrest, 1963). These elaborations repre-sented only minor divergence from the Standards, each implyingthat the four validities of the first edition failed to capture all theimportant distinctions.

The fact that the second edition, published in 1966, included nodiscussion of these proposed elaborations indicates that they madeno substantive impact on the consensus position. In fact, not onlywas the taxonomy in the first edition deemed to have captured allthe important distinctions—that is, all the important “interpretativeinferences” (APA et al., 1974, p. 26)—it actually reduced thenumber of validities from four to three. Despite this tacit rebuttal,new VMLs continued to be invented. Some of these divergedsignificantly from the consensus position (e.g., Carver, 1974; Lord& Novick, 1968); others could be seen more as elaboration (e.g.,Popham, 1978). Incidentally, although the invention of wholly newtypes of validity during this period would appear, at least byimplication, to represent genuine divergence from the Standards(e.g., Bemis, 1968; Boehm, 1972; Dick & Hagerty, 1971), theywere not necessarily presented as such.

In summary, this early proliferation of VMLs seems to representa groundswell of dissatisfaction with standards for thinking andtalking about validity presented within the Standards; although itis fair to say that this was not always presented as explicit diver-gence or dissatisfaction. Even some of the most influential scholars

Thi

sdo

cum

ent

isco

pyri

ghte

dby

the

Am

eric

anPs

ycho

logi

cal

Ass

ocia

tion

oron

eof

itsal

lied

publ

ishe

rs.

Thi

sar

ticle

isin

tend

edso

lely

for

the

pers

onal

use

ofth

ein

divi

dual

user

and

isno

tto

bedi

ssem

inat

edbr

oadl

y.

310 NEWTON AND SHAW

Page 11: Standards for talking and thinking about validity

of the day felt the need either to elaborate upon the consensusposition or to challenge it, through the invention of new VMLs.Ironically, this dissatisfaction—in stark contrast to the position thatwas to be championed by Messick—seemed to argue for increasedfragmentation, not unification.

1985 to present-day. The continued use of VMLs, followingthe official unification of validity theory in the fourth edition of theStandards, represents a disregard of standards for talking aboutvalidity. However, in the absence of further research, it would beimpossible to determine confidently either the extent to which thisrepresented intentional disregard or the extent to which it alsorepresented a rejection of standards for thinking about validity.Nonetheless, the continued use of terms like predictive validity,content validity, incremental validity, and factorial validity in titlesfrom prominent EPM journal articles does seem to hint at a certainamount of genuine divergence and dissatisfaction. All of thesearticles presumably passed through a review process, with theirtitles, at least, reviewed by the journal editor. If the continued useof traditional VMLs were more careless than intentional, or morecasual than formal, we might expect this to have been picked upduring the review process.

When it comes to the invention of new VMLs, the case isstronger still, that these are proposed by people who claim to bevalidity scholars, so we would certainly hope that any divergencefrom the consensus position was intentional. However, in explic-itly disregarding the Standards, we might also expect them tocomment upon this, and this was not always the case. The contin-ued proliferation of new VMLs during this phase remains ironic,but even more so now, because it represents the extension of atrend toward increased fragmentation of the concept of validityagainst the backdrop of its official unification. It is odd that thepeculiarity of this phenomenon seems not to have been widelydiscussed.

Immediate reflections. The most ironic new VML of recentyears is consequential validity, a term that is now common inthe literature. It has been the focus of much debate concerning thefourth standard for talking about validity mentioned earlier. Theirony derives from the fact that the term continues to be attributedto Messick (e.g., Lissitz & Samuelsen, 2007, p. 445) despite thefact that it was Messick who wrote the definitive critique of VMLs(Messick, 1980). Messick was interested in consequential evidenceof validity, but explicitly refrained from using the term consequen-tial validity, for obvious reasons. This slip raises an interesting andimportant question: How is it that even scholars of validity occa-sionally fail to see the irony in attributing the term consequentialvalidity to Messick (1989)? It is almost as though there weresomething inherently incorrigible concerning loose talk about va-lidity, something that repeatedly defies any attempt to control it. Itseems true for talk of test validity, and it seems true for the use ofVMLs.

To be fair, though, the official rejection of VMLs is not withoutconsequence. It leaves the supracommunity without a multitude ofterms for identifying distinctive ideas within what has now becomean extremely broad concept. Messick (1980) provided a list ofterms that might be substituted for some of the most popularVMLs, but these have largely failed to transfer into mainstreamdiscourse. No one else, to our knowledge, has either extended hislist or offered an alternative one.

The official rejection of VMLs also leaves the supracommunityin a bizarre position whereby official standards dismiss the use ofVMLs—so there are no longer any official definitions of contentvalidity, predictive validity, factorial validity, etc.—yet theseterms remain a feature of custom and practice (i.e., of everydaydiscourse between measurement specialists). Where, then, oughtvalidity novices to turn in order to find out what their colleaguesand peers are talking about?

The more new VMLs appear on the scene, the greater thepotential for confusion within the supracommunity—let alone be-yond it—especially in light of their removal from the Standards.To date, we have identified 122 discrete VMLs, each invented tocapture some aspect or another of validity for measurement (seeTable 4). We have identified another 35 that seem to be no morethan synonyms for those presented in Table 4.

Within Table 4 are VMLs that express discriminable, but actu-ally quite similar, concepts: for example, logical, rational, content,curricular, face, and context; empirical, practical, and criterion;local and situational; intrinsic and construct. Then there are VMLsthat appear only once in Table 4 but that have been endowed withcompletely different meanings by different measurement scholars,such as decision validity (e.g., Hambleton, 1980, vs. Brookhart,2009), differential validity (e.g., Richardson, 1936, vs. Linn,1978), internal/external validity (e.g., Guttman, 1950, vs. Loev-inger, 1957), intrinsic validity (e.g., Gulliksen, 1950, vs. Guilford,1954, vs. Cureton, 1965), functional validity (e.g., Popham, 1978,vs. Cone, 1995), practical validity (e.g., Guilford, 1946, vs. Camp-bell, 1960), prospective validity (e.g., Jolliffe et al., 2003, vs.Hoffman & Davis, 1995), psychometric validity (e.g., Carver,1974, vs. Guion, 2011), retrospective validity (e.g., Jolliffe et al.,2003, vs. Evers et al., 2010), semantic validity (e.g., Burns, 1995,vs. Larsen et al., 2008, vs. Hanlon et al., 2008), and structuralvalidity (e.g., Hill et al., 2007, vs. Loevinger, 1957).

There are also VMLs for measurement that have a differentmeaning as VMLs for research, such as construct validity (e.g.,Cronbach & Meehl, 1995, vs. Cook & Campbell, 1979, vs. Lather,1986), internal/external validity (e.g., Guttman, 1950, vs. Camp-bell, 1957), descriptive validity (e.g., Popham, 1978, vs. Maxwell,1992), relational validity (e.g., Guion, 2011, vs. Julnes, 2011), facevalidity (e.g., Guilford, 1946, vs. Lather, 1986), and interpretivevalidity (e.g., Briggs, 2004, vs. Maxwell, 1992).

Finally, the meanings of some of the oldest and most popularVMLs have multiplied steadily over time, to the point where it isalmost impossible to say what their real meaning might once havebeen, such as construct validity (e.g., Bechtoldt, 1959; Borsboomet al., 2009; Kane, 2008; Loevinger, 1957; Maraun, Slaney, &Gabriel, 2009; Messick, 1992; Smith, 2005), content validity (e.g.,Ebel, 1983; Fitzpatrick, 1983; Guion, 1977a, 1977b; Lennon,1956; Messick, 1975; Murphy, 2009; Sireci, 1998; Yalow &Popham, 1983), and face validity (e.g., Guilford, 1946; Mosier,1947; Nevo, 1985; Rulon, 1946). In fact, having reviewed thevariety of definitions provided for face validity, content validity,and construct validity, respectively, Mosier (1947), Fitzpatrick(1983), and Guion (2011) each suggested that their respectiveterms be abandoned. Guion even observed that the term validitymight have outlived its usefulness. Although face, content, andconstruct have probably had more meanings associated with themthan any other VML, a similar story of ambiguity could also betold for many more, including factorial validity, differential valid-

Thi

sdo

cum

ent

isco

pyri

ghte

dby

the

Am

eric

anPs

ycho

logi

cal

Ass

ocia

tion

oron

eof

itsal

lied

publ

ishe

rs.

Thi

sar

ticle

isin

tend

edso

lely

for

the

pers

onal

use

ofth

ein

divi

dual

user

and

isno

tto

bedi

ssem

inat

edbr

oadl

y.

311STANDARDS FOR TALKING AND THINKING ABOUT VALIDITY

Page 12: Standards for talking and thinking about validity

ity, incremental validity, instructional validity, curricular validity,and so on.

It is worth mentioning in passing that many of the new VMLsof recent years have been quite insubstantial. For example,operational validity appeared only in the title of the article byLievens et al. (2005) and seemed merely to imply validitywithin an operational setting. Similarly, occupational validityappeared only in the title of the article by Bemis (1968),referring to little more than validation in an occupational set-ting. Extratest validity appeared only in the title and abstract ofHopwood et al. (2008) and was not actually defined in thearticle. Likewise, other than in the title, elemental validityappeared only once in the article by Hill et al. (2007), andstructural validity appeared only in the title. In short, theinvention of a new VML makes for a snappy title but oftenconveys more style than substance.

Finally, it is tempting to speculate that there may be just afew basic kinds of validity, or categories of validity evidence,into which the vast majority of the VMLs that have beenproposed over the years can straightforwardly be collapsed:perhaps the five categories of evidence from the 1999 Stan-dards, or even the four kinds of validity from the 1954 Stan-dards. There is certainly some truth in this speculation. Forinstance, the majority of VMLs that we identified were intro-duced as part of an explicit scheme for classifying aspects ofvalidity. Almost all of these schemes highlighted contrasts verysimilar to those drawn in the original 1954 Standards; some-

times excluding traditional kinds/categories (e.g., excludingpredictive), sometimes including new kinds/categories (e.g.,including consequential), sometimes subdividing traditionalkinds/categories (e.g., dividing construct into trait and nomo-logical). The new VMLs were often introduced to foregroundsubtle, albeit important, differences in emphasis (e.g., cognitivecompared with content), rather than radically different ways ofthinking about validity.

When we attempted to classify our comprehensive list ofVMLs in terms of “broad similarity” to the Trinitarian scheme,we found that the large majority overlapped significantly withthe traditional three categories, with some spanning two or allthree of them. A substantial minority, around 20%, did not fitcomfortably into any (e.g., cash, common sense, cross-age),although it is fair to say that few of these feature heavily in theliterature. Those VMLs that did not fit comfortably, but that dofeature significantly in the literature, included incremental,procedural, and three that correspond to the consequences cat-egory from the 1999 Standards: systemic, washback, and con-sequential. These findings provided some informal support forthe utility of the five-way classification of sources of evidencefrom the 1999 Standards; based, as they were, upon the three-way classification of sources of evidence from the 1985 edition,expanded to encapsulate response as well as content sampling,and to include consequences; this having, in turn, derived fromthe three kinds of validity within the 1966 edition.

Table 4One Hundred and Twenty-Two Kinds of Validity for Measurement

Administrative Descriptive Instructional RationalArtifactual Design Internal test RawBehavior domain Diagnostic Internal RelationalCash Differential Interpretative RelevantCluster domain Direct Interpretive RepresentationalCognitive Discriminant Intrinsic ResponseCommon sense Discriminative Intrinsic content RetrospectiveConcept Domain Intrinsic correlational SamplingConceptual Domain-selection Intrinsic rational ScientificConcrete Edumetric Item ScoringConcurrent Elaborative Job component Self-definingConcurrent true Elemental Judgmental SemanticCongruent Empirical Linguistic Single-groupConsensual Empirical-judgmental Local SiteConsequential Etiological Logical SituationalConstruct External test Longitudinal SpecificConstructor External Lower-order StructuralContent Extratest Manifest SubstantiveContext Face Natural SummativeContextual Factorial Nomological SymptomConvergent Fiat Occupational SyntheticCorrelational Forecast true Operational SystemCriterion Formative Performance SystemicCross-age Functional Practical TheoreticalCross-cultural General Predictive TraitCross-sectional Generalized Predictor TranslationCultural Generic Procedural TreatmentCurricular Higher-order Prospective TrueDecision Incremental Psychological and logical UserDefinitional Indirect Psychometric WashbackDerived Inferential

Note. A fully referenced list of all of the kinds of validity that are mentioned in this table is available from the first author.

Thi

sdo

cum

ent

isco

pyri

ghte

dby

the

Am

eric

anPs

ycho

logi

cal

Ass

ocia

tion

oron

eof

itsal

lied

publ

ishe

rs.

Thi

sar

ticle

isin

tend

edso

lely

for

the

pers

onal

use

ofth

ein

divi

dual

user

and

isno

tto

bedi

ssem

inat

edbr

oadl

y.

312 NEWTON AND SHAW

Page 13: Standards for talking and thinking about validity

The Desirability of Standards for Thinking andTalking About Validity

We end by reflecting upon the desirability and viability ofstandards for talking and thinking about validity.

Standards for Thinking About Validity

A central component of scientific practice is persuasion, that is,the attempt by one or more scientists to bring others around to theirview. In this respect, consensus is the holy grail of science, andstandards for thinking about validity therefore represent an appro-priate ambition for EPM, even from a purely scientific perspective.From the perspective of the scientist, of course, these would bedescriptive standards, not prescriptive ones. It would be the an-tithesis of science to require all scientists to work within a commonparadigm.

What, then, of prescriptive standards, like those found in theStandards? These are more pragmatic, sometimes legalistic, de-vices. They specify principles of professional practice that help toestablish the credibility of practitioners within a community andthe credibility of the profession within society. Educationalists andpsychologists around the world recognize the significance of pre-scriptive standards, through which to establish and defend thecredibility of measurement practice.

The prescriptive professional standards of EPM in North Amer-ica have built upon descriptive scientific standards. Thus, despitetheir pragmatic focus, successive editions of the Standards havesought to establish their credentials rationally, by being groundedin the scientific paradigms of their day, from the traditional con-struct validity of Cronbach and Meehl (1955) to the modernconstruct validity of Messick (1989). The introductory text to eachof the successive validity chapters expressed the consensus judg-ment of the EPM professions concerning the descriptive standardof the day for thinking about validity. The Standards thereforeeffectively prescribe a particular way of thinking about validity asa rational foundation for measurement practice. The idea of con-sensus seems doubly important in establishing credibility: consen-sus, on the one hand, between scientific and professional concep-tions of validity, and consensus, on the other, among EPMprofessionals in how they conceptualize validity.

One final point: Standards for thinking about validity wouldseem to be important if consensus is to be reached on how todefine the many other technical characteristics through whichEPM is to be evaluated (e.g., reliability, bias, fairness).

Standards for Talking About Validity

The conclusion that standards for thinking about validity areimportant does not entail an obligation upon measurement special-ists to uphold corresponding standards for talking about validity.So why were such standards (prescriptive, no less) ever thought tobe necessary? This only really makes sense against a backdrop ofnegative impacts arising from inappropriate talk. The contextwithin which the first edition of the Standards was publishedexhibited these features. Claims to validity were misinterpreted asthough, for instance, a single correlation coefficient could sanctionthe use of a test for any purpose under any condition. Standards fortalking about validity were intended to help to rectify this by

helping users to understand the conditional nature of any claim tovalidity. Now, assuming that there used to be a reasonable case forpromoting standards for thinking about validity, does the sameremain true today?

The fact that the Standards rejected the use of VMLs in themid-1980s suggests that standards for talking about validity con-tinued to be important. It was recognized that talking as thoughthere were different types of validity had led measurement spe-cialists to think about validity as a fragmented concept, withconsequent negative impacts upon validation (Dunnette & Bor-man, 1979; Messick, 1980). Frisbie (2005) insisted that similarnegative impacts from disregarding standards for talking aboutvalidity continued to occur even into the 21st century, from poortesting practice to weak validation, to widespread miscommunica-tion within and beyond the professions. In short, there are goodpragmatic reasons to think that standards for talking about validityare desirable, to help clarify standards for thinking about validityand validation.

The Viability of Standards for Thinking and TalkingAbout Validity

We began by asking why—if there is supposed to be a consen-sus over standards for talking and thinking about validity—thestandards continue to be disregarded in practice. We presented newevidence to illustrate this phenomenon, and discussed possiblereasons that included intentional misuse, lack of awareness ormisunderstanding, and genuine divergence from the consensus.Our historical analysis demonstrated an enduring lack of consen-sus concerning standards for talking about validity. It is hard toreach any definitive conclusion concerning the extent to which thedisregarding of standards for talking about validity representsdeeper dissatisfaction with standards for thinking about validity.Although there does appear to be an element of genuine frustrationwith the terminology for marking important dimensions of qualityin EPM, our general impression from reading the literature onvalidity theory is that there is little appetite for returning to afragmented conception. We do, however, note substantial dis-agreement over how, and to what, the term validity ought to beapplied, representing a fundamental lack of consensus over stan-dards for thinking about validity. We end by highlighting fouroutstanding challenges and a strategy that might go some waytoward ameliorating them.

The first two challenges relate to the use of VMLs, and are intension. On the one hand, it is clear that VMLs not only continueto be used but continue to be invented. This inextinguishable desireto fragment would seem to be the antithesis of unification. Yet,whether dimensions of quality in EPM can be more effectivelymarked through an expansion of the official lexicon would seem toremain an open question. Clearly, if the move toward unificationhas meant a blurring of important distinctions, then it has made itharder to teach validity, harder to learn validity, and thus increasedthe risk of lack of awareness and misunderstanding.

On the other hand, it is clear that the rampant proliferation ofVMLs has not served EPM well. The only VML-based taxonomiesthat have ever gained widespread respect are to be found in thefirst three editions of the Standards. And the individual VMLs thathave been proposed by so many, over so many years, simply donot cohere as a substantive contribution to validity theory; not that

Thi

sdo

cum

ent

isco

pyri

ghte

dby

the

Am

eric

anPs

ycho

logi

cal

Ass

ocia

tion

oron

eof

itsal

lied

publ

ishe

rs.

Thi

sar

ticle

isin

tend

edso

lely

for

the

pers

onal

use

ofth

ein

divi

dual

user

and

isno

tto

bedi

ssem

inat

edbr

oadl

y.

313STANDARDS FOR TALKING AND THINKING ABOUT VALIDITY

Page 14: Standards for talking and thinking about validity

they have ever been presented as such. In fact, they are downrightconfusing: There are so many of them; some use the same term fordifferent meanings; some use different terms for the same mean-ing; some of them seem extremely trivial; and so on. In short, thereseems to be a need for standards for talking about validity that arecapable of marking all the important distinctions without beingdistracted by unimportant ones. We are certainly at liberty to askwhether the categories used in the fifth edition of the Standards areoptimal in this respect, although, as we discussed earlier, they doseem to resonate, at least, with the vast majority of VMLs thathave been proposed over the years.

The second two challenges relate to the TVOTT debate. As weexplained earlier, there actually seems to be little disagreementover the principle that any claim to validity is conditional, that is,little debate over this standard for thinking about validity, onlyover the corresponding standard for talking about it. Those whoinsist on referring to TVOTT tend simply to presume condition-ality and see no need to mark it discursively. The third challenge,therefore, is how we might resolve this standoff.

There is, however, a more substantial debate lurking below thesurface, that is, genuine disagreement over the level (or levels) atwhich a claim to validity might be staked. Four such levelsillustrate the spectrum of opinion:

• the elements of the measurement procedure (e.g., “the item isvalid”),

• the measurement procedure (e.g., “the test is valid”),• the decision procedure (e.g., “the use of the test is valid”), and• the testing policy (e.g., “the system is valid”).

At each of these levels, the purpose of declaring its subject validis, in effect, to declare that its subject is fit to be used as onecomponent of a higher level process: The item is fit to be used inthe measurement procedure; the measurement procedure is fit to beused in the decision procedure; the decision procedure is fit to beused in the testing policy; the testing policy is fit to be used in theconstruction of a good society. With each new level, the claim tovalidity concerns different kinds of conclusion, derived from dif-ferent kinds of evidence and analysis.

Since the mid-1950s, successive editions of the Standards havealways adopted a fairly broad conception of validity, tailoredultimately to the intended use of test scores (i.e., to the decisionprocedure). In recent years, many have wanted to extend validityto the level of testing policy. There are two related problems here.First, some now believe that the concept of validity has become tooglobal to be useful (e.g., Brennan, 1998). It is not just that validityhas become very hard to grasp and to communicate. It is also that,as it has moved beyond traditional territories and boundaries, it hasbecome increasingly tricky to operationalize. If validation is toinclude an evaluation of measurement aims, decision making aims,and broader testing policy aims, then who ought to coordinateevaluation on this scale, and who ought ultimately to be respon-sible for it? Concerns such as these, and others, have split the fieldinto those who insist that validity ought to be considered a narrow,scientific concept (e.g., Borsboom et al., 2004; Cizek, 2012; Lis-sitz & Samuelsen, 2007; Maguire, Hattie, & Haig, 1994; Mehrens,1997; Popham, 1997; Scriven, 2002) and those who insist that itought to be considered a broad, scientific and ethical one (e.g.,Cronbach, 1988; Kane, 2013; Linn, 1997; Messick, 1980; Shepard,1997).

Second, the failure to restrict talk of validity to a particular levelundermines any attempt to specify a precise technical meaning forvalidity within EPM. For example, many measurement specialistsare quite happy to refer to the validity of decisions and interpre-tations and test scores and tests and questions, and so on (e.g.,Pollitt, 2012). Yet, the more liberally we use the term, the lessprecise its meaning becomes.

A reviewer of the first draft of this article commented that theattempt to provide a precise technical definition of validity was invain because it is a family resemblance concept; that is, there areclusters of features associated with all uses of the term validity, butno one feature that is common to all. This very helpfully gets to thecrux of the matter, whether validity is, or could be fashioned into,a family resemblance concept. In terms of the current status ofvalidity, there are three alternatives: Its meaning is captured by aprecise technical definition; in the absence of a precise technicaldefinition, its meaning is captured by unwritten rules that governits application (i.e., it functions as a family resemblance concept);or it has no clear meaning and it tends to be used indiscriminately,arbitrarily, or in all sorts of different ways.

As we have seen, the North American EPM supracommunityhas been trying to provide a precise technical definition of validityfor the best part of a century. The first official definition—framedexclusively in terms of measurement quality—was fairly precise:the degree to which a test measures what it purports to measure. Asthis definition was expanded to include prediction, it became lessprecise. When the concept was officially fragmented into a smallnumber of kinds, it came to elude definition. The proliferation ofunofficial VMLs epitomized and exacerbated this tolerance ofimprecision. Subsequently, the unification of validity encouragedus to embrace precision once again, reestablishing measurementquality (i.e., score meaning) as the essence of all validity (seeNewton, 2012a). Nowadays, though, it is clear that the term is usedin all sorts of different ways, many of which appear to conflict withthe official consensus position. Indeed, in practice, there does notseem to be any consensus over the proper application of the term:Some say it applies simply to tests; others say it applies tointerpretations, or even to systems; while others say that it appliesto items, to testing policy, and to anything in between. Even theofficial consensus position itself is somewhat vague and confused(Newton, 2012a).

In summary, unlike the field of formal logic, where it has beenpossible to agree upon a precise technical definition for validity, ithas not been possible to reach agreement in the field of EPM. Moreimportantly, though, nor has it been possible to reach agreementupon its proper application in the absence of a precise technicaldefinition (i.e., validity fails even to count as a family resemblanceconcept). The failure to reach agreement over a precise technicaldefinition, despite a century of negotiation, suggests that it may notbe a viable option. Yet, might it still be possible to negotiatemeaning for validity as a family resemblance concept? This is thefourth and most fundamental challenge.

As a final aside, we briefly return to the challenge of condition-ality. Recall that reference to TVOTT was dismissed because anyclaim to validity is conditional. To many, it seemed that referringto validity as a property of interpretations, not tests, provided astraightforward solution to this problem. Yet, this would only betrue if interpretation were somehow immune to conditionality, orif conditionality were somehow built into interpretation. The for-

Thi

sdo

cum

ent

isco

pyri

ghte

dby

the

Am

eric

anPs

ycho

logi

cal

Ass

ocia

tion

oron

eof

itsal

lied

publ

ishe

rs.

Thi

sar

ticle

isin

tend

edso

lely

for

the

pers

onal

use

ofth

ein

divi

dual

user

and

isno

tto

bedi

ssem

inat

edbr

oadl

y.

314 NEWTON AND SHAW

Page 15: Standards for talking and thinking about validity

mer is clearly not true. Any interpretation (of test scores) will beconditional (e.g., upon whether the test had been administeredproperly, upon whom it was administered to, and so on). Likewise,the latter is simply not feasible. It would be impossible to identifyeach and every possible condition upon which the validity of theintended interpretation rested, either from a practical perspectiveor from a logical one.

On reflection, it seems that the threat of misunderstandingassociated with TVOTT arises from the decision to declare any-thing valid or invalid, be that a test, an interpretation, or a testingpolicy. First, the grammar of the term invites an absolute interpre-tation, because it encourages us to think in terms of black andwhite, valid or invalid. Second, as soon as the term is applied toanything—for example, an element of a measurement procedure, ameasurement procedure, a decision procedure, or a testing poli-cy—the declaration of validity functions like a stamp of approval,a green light to proceed, or a license to practice. It declares, in apretty absolute manner, fitness-for-purpose. As validity is de-clared, conditionality slips out of view.

Assuming that it may not be possible to agree upon a precisetechnical definition for validity, what are the prospects for fash-ioning it into a family resemblance concept by agreeing uponparameters for its application? The trend nowadays, even amongmany validity theorists, seems to be to apply the term fairlyliberally, to interpretations and uses and even to impacts fromtesting that bear no relation to measurement or decision making. If,as we have just discussed, the real problem with talk of TVOTT isdeclaring anything valid (i.e., it is not solved by restricting the termvalidity to interpretations and uses), then reaching agreement upona very broad use of the term—applicable to items, to testing policy,and to anything in between—appears to be far more viable. Therecould be much to recommend this strategy. Kane (2012) arguedthat embracing a broad conception of validity increases the like-lihood that important evaluation concerns are not overlooked (seealso Bennett, 2012). Pollitt (2012) argued that embracing a broadconception of validity provides us with the conceptual means tohold everyone involved in test development to account.

Agreeing upon such a permissive standard would certainlysuggest that we had fashioned validity into a family resemblanceconcept. Indeed, it seems likely that we would thereby havecreated a concept very much like quality: quality of the item,quality of the test, quality of the testing policy. If so, then it wouldbehoove us to consider whether the concept of validity actuallycaptured anything beyond the concept of quality. Indeed, if the useof validity became indistinguishable from the use of quality—andit is hard to see what might distinguish the two concepts construedso liberally—then why would we retain the concept of validity atall?

The concept of quality is transparently liberal and has theadvantage of having a general, nontechnical, commonsense mean-ing. It therefore invites interlocutors to clarify what they mightmean by quality in the particular context of application. Theconcept of validity, by way of contrast, is quite opaque, havingbecome obscured by a century of attempts to imbue it with precisetechnical meaning. Moreover, a legacy of having thus strived forprecision is the risk that interlocutors will presume that the matterhad now been settled, potentially discouraging them from clarify-ing what they might mean by validity in the particular context ofapplication. Ultimately, if the way in which we chose to use the

term validity rendered it tantamount to quality, then we would bewell advised simply to talk of quality.

Usefully, the grammar of quality would help to discourage usfrom making unnecessary declarations of fitness-for-purpose, sim-ply because it has no direct analogue for valid. How often do weever really need to declare anything within the field of EPM eithervalid or invalid? In those rare instances when declaration isdeemed to be essential, the use of alternative terms such aslegitimate or defensible would carry less risk of conveying inap-propriate surplus meaning. Declaring a procedure valid transformsvalidity into an all-or-nothing concept (Newton, 2012a), whichmany experts consider to be an inappropriate and harmful image(e.g., Markus, 2012; Pollitt, 2012). The less frequently we makesuch declarations, the less this image is promulgated. The grammarof quality helpfully discourages all-or-nothing thinking.

Referring to quality, instead of validity, might also help toextinguish a long-standing confusion between validity and reliabil-ity. Debate continues over how best to theorize the relationshipbetween these two: as though they represent largely distinct char-acteristics (e.g., Cattell, 1964); as though they represent regions ona single continuum (e.g., Campbell & Fiske, 1959; Marcoulides,2004; Thurstone, 1931); or as though one, reliability, is simply adimension within the other (e.g., Cureton, 1951; Kane, 2004;Messick, 1998). Quality, on the other hand, naturally establishesitself as a superordinate category within which reliability mightcomfortably reside as a dimension, thereby helping to achieve thesynthesis recommended by Cureton, Messick, and Kane.

The most important motivation for embracing the concept ofquality, and abandoning the concept of validity, is based upon theempirical evidence amassed in preceding sections. Over a periodthat spans nearly a century, it has proved impossible to secureconsensus over the meaning of validity; not even as a familyresemblance concept. This was evident in the long-standing debateover reference to TVOTT, as well as in the rampant proliferationof VMLs even after they had officially been rejected. It is currentlyepitomized in the standoff between those who insist upon a nar-row, scientific conception of validity and those who insist upon abroad, scientific and ethical one. We need now to take radicalaction to dissipate this tension. If we are to talk meaningfully andproductively about the characteristics of quality in EPM, then weneed to bypass the concept of validity. So why not just cut out themiddleman and talk directly about quality? This is to recommendquality as the principal family resemblance concept for evaluationwithin EPM, applicable equally across the three principal foci ofmeasurement, decision making, and testing policy.

What exactly might we mean by quality in different contexts?What are the important distinctions that we need to capture, orrecapture, when theorizing evaluation within EPM? Some mightbe tempted to consider breathing new life into the traditionallexicon, introducing quality modifier labels like content quality,predictive quality, and factorial quality. As Sireci (1998, 2007)reminded us, the rejection of VMLs has made it harder to discusssome of the important characteristics of test quality, and this couldgo some way to rectifying the situation. However, the very act ofbreathing new life into the old labels would risk reifying thoseconcepts, in much the same way as the traditional VML formula-tion did. Furthermore, the adoption of certain useful quality mod-ifier labels might open the floodgates to many far less useful ones:to summative quality, occupational quality, site-quality, extratest

Thi

sdo

cum

ent

isco

pyri

ghte

dby

the

Am

eric

anPs

ycho

logi

cal

Ass

ocia

tion

oron

eof

itsal

lied

publ

ishe

rs.

Thi

sar

ticle

isin

tend

edso

lely

for

the

pers

onal

use

ofth

ein

divi

dual

user

and

isno

tto

bedi

ssem

inat

edbr

oadl

y.

315STANDARDS FOR TALKING AND THINKING ABOUT VALIDITY

Page 16: Standards for talking and thinking about validity

quality, and a whole host of other dubious distinctions that haveprobably done more to mystify the landscape of validity theoryover the decades than to clarify it. We follow the spirit of modernvalidity theory in preferring to think of quality, within EPM, asmore holistic than fragmented, guided by three principal evalua-tion foci: quality of measurement, quality of decision making, andquality of testing policy.

References

Allen, M. J. (2004). Assessing academic programs in higher education.Bolton, MA: Anker.

American Educational Research Association, American Psychological As-sociation, & National Council on Measurement in Education. (1985).Standards for educational and psychological testing. Washington, DC:Author.

American Educational Research Association, American Psychological As-sociation, & National Council on Measurement in Education. (1999).Standards for educational and psychological testing. Washington, DC:Author.

American Psychological Association. (1952). Technical recommendationsfor psychological tests and diagnostic techniques: Preliminary proposal.American Psychologist, 7, 461–475. doi:10.1037/h0056631

American Psychological Association, American Educational Research As-sociation, & National Council on Measurement in Education. (1966).Standards for educational and psychological tests and manuals. Wash-ington, DC: Author.

American Psychological Association, American Educational Research As-sociation, & National Council on Measurement in Education. (1974).Standards for educational and psychological tests. Washington, DC:Author.

American Psychological Association, American Educational Research As-sociation, & National Council on Measurements Used in Education.(1954). Technical recommendations for psychological tests and diag-nostic techniques. Psychological Bulletin, 51(2, pt. 2), 1–38. doi:10.1037/h0053479

Angoff, W. H. (1988). Validity: An evolving concept. In H. Wainer & H. I.Braun (Eds.), Test validity (pp. 19–32). Hillsdale, NJ: Erlbaum.

Austin, J. (1995). The province of jurisprudence determined (W. E. Rum-ble, Ed.). Cambridge, England: Cambridge University Press. doi:10.1017/CBO9780511521546 (Original work published 1832)

Bechtoldt, H. P. (1959). Construct validity: A critique. American Psychol-ogist, 14, 619–629. doi:10.1037/h0040359

Bemis, S. E. (1968). Occupational validity of the General Aptitude TestBattery. Journal of Applied Psychology, 52, 240–244. doi:10.1037/h0025733

Bennett, R. E. (2012). Consequences that cannot be avoided: A response toPaul Newton. Measurement: Interdisciplinary Research and Perspec-tives, 10, 30–32. doi:10.1080/15366367.2012.686865

Boehm, V. R. (1972). Negro–White differences in validity of employmentand training selection procedures. Journal of Applied Psychology, 56,33–39. doi:10.1037/h0032130

Borsboom, D. (2012). Whose consensus is it anyway? Scientific versuslegalistic conceptions of validity. Measurement: Interdisciplinary Re-search and Perspectives, 10, 38 – 41. doi:10.1080/15366367.2012.681971

Borsboom, D., Cramer, A. O. J., Keivit, R. A., Scholten, A. Z., & Franic,S. (2009). The end of construct validity. In R. W. Lissitz (Ed.), Theconcept of validity: Revisions, new directions, and applications (pp.135–170). Charlotte, NC: Information Age.

Borsboom, D., & Mellenbergh, G. J. (2007). Test validity in cognitiveassessment. In J. P. Leighton & M. J. Gierl (Eds.), Cognitive diagnosticassessment for education: Theory and applications (pp. 85–115).

New York, NY: Cambridge University Press. doi:10.1017/CBO9780511611186.004

Borsboom, D., Mellenbergh, G. J., & van Heerden, J. (2004). The conceptof validity. Psychological Review, 111, 1061–1071. doi:10.1037/0033-295X.111.4.1061

Bracht, G. H., & Glass, G. V. (1968). The external validity of experiments.American Educational Research Journal, 5, 437–474. doi:10.3102/00028312005004437

Brennan, R. L. (1998). Misconceptions at the intersection of measurementtheory and practice. Educational Measurement: Issues and Practice, 17,5–9. doi:10.1111/j.1745-3992.1998.tb00615.x

Briggs, D. C. (2004). Comment: Making an argument for design validitybefore interpretive validity. Measurement: Interdisciplinary Researchand Perspectives, 2, 171–174. doi:10.1207/s15366359mea0203_2

Brookhart, S. M. (2009). The many meanings of “multiple measures.”Educational Leadership, 67, 6–12.

Buckingham, B. R., McCall, W. A., Otis, A. S., Rugg, H. O., Trabue,M. R., & Courtis, S. A. (1921). Report of the Standardization Commit-tee. Journal of Educational Research, 4, 78–80.

Burns, W. C. (1995). Content validity, face validity, and quantitative facevalidity. Retrieved from http://www.burns.com/wcbcontval.htm

Camara, W. J., & Lane, S. (2006). A historical perspective and currentviews on the Standards for Educational and Psychological Testing.Educational Measurement: Issues and Practice, 25, 35– 41. doi:10.1111/j.1745-3992.2006.00066.x

Campbell, D. T. (1957). Factors relevant to the validity of experiments insocial settings. Psychological Bulletin, 54, 297–312. doi:10.1037/h0040950

Campbell, D. T. (1960). Recommendations for APA test standards regard-ing construct, trait, or discriminant validity. American Psychologist, 15,546–553. doi:10.1037/h0048255

Campbell, D. T., & Fiske, D. W. (1959). Convergent and discriminantvalidation by the multitrait–multimethod matrix. Psychological Bulletin,56, 81–105. doi:10.1037/h0046016

Campbell, D. T., & Stanley, J. C. (1966). Experimental and quasi-experimental designs for research. Chicago, IL: Rand McNally.

Carver, R. P. (1974). Two dimensions of tests: Psychometric and edumet-ric. American Psychologist, 29, 512–518. doi:10.1037/h0036782

Cattell, R. B. (1964). Validity and reliability: A proposed more basic set ofconcepts. Journal of Educational Psychology, 55, 1–22. doi:10.1037/h0046462

Cizek, G. J. (2012). Defining and distinguishing validity: Interpretations ofscore meaning and justification of test use. Psychological Methods, 17,31–43. doi:10.1037/a0026975

Cizek, G. J., Rosenberg, S. L., & Koons, H. H. (2008). Sources of validityevidence for educational and psychological tests. Educational and Psy-chological Measurement, 68, 397–412. doi:10.1177/0013164407310130

Cone, J. D. (1995). Assessment practice standards. In S. C. Hayes, V. M.Follette, R. M. Dawe, & K. Grady (Eds.), Scientific standards forpsychological practice: Issues and recommendations (pp. 201–224).Reno, NV: Context Press.

Cook, T. D., & Campbell, D. T. (1979). Quasi-experimentation: Designand analysis issues for field settings. Boston, MA: Houghton Mifflin.

Cronbach, L. J. (1949). Essentials of psychological testing. New York, NY:Harper.

Cronbach, L. J. (1988). Five perspectives on validity argument. In H.Wainer & H. I. Braun (Eds.), Test validity (pp. 3–17). Hillsdale, NJ:Erlbaum.

Cronbach, L. J. (1989). Construct validation after thirty years. In R. L. Linn(Ed.), Intelligence: Measurement, theory and public policy (pp. 147–171). Urbana: University of Illinois Press.

Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in psychologicaltests. Psychological Bulletin, 52, 281–302. doi:10.1037/h0040957

Thi

sdo

cum

ent

isco

pyri

ghte

dby

the

Am

eric

anPs

ycho

logi

cal

Ass

ocia

tion

oron

eof

itsal

lied

publ

ishe

rs.

Thi

sar

ticle

isin

tend

edso

lely

for

the

pers

onal

use

ofth

ein

divi

dual

user

and

isno

tto

bedi

ssem

inat

edbr

oadl

y.

316 NEWTON AND SHAW

Page 17: Standards for talking and thinking about validity

Cureton, E. E. (1951). Validity. In E. F. Lindquist (Ed.), Educationalmeasurement (pp. 621–694). Washington, DC: American Council onEducation.

Cureton, E. E. (1965). Reliability and validity: Basic assumptions andexperimental designs. Educational and Psychological Measurement, 25,327–346. doi:10.1177/001316446502500204

Dick, W., & Hagerty, N. (1971). Topics in measurement: Reliability andvalidity. New York, NY: McGraw-Hill.

Downing, S. M. (2003). Validity: On the meaningful interpretation ofassessment data. Medical Education, 37, 830–837. doi:10.1046/j.1365-2923.2003.01594.x

Dunnette, M. D. (1992). It was nice to be there: Construct validity then andnow. Human Performance, 5, 157–169. doi:10.1207/s15327043hup0501&2_9

Dunnette, M. D., & Borman, W. C. (1979). Personnel selection andclassification systems. Annual Review of Psychology, 30, 477–525.doi:10.1146/annurev.ps.30.020179.002401

Ebel, R. L. (1983). The practical validation of tests of ability. EducationalMeasurement: Issues and Practice, 2, 7–10. doi:10.1111/j.1745-3992.1983.tb00688.x

English, H., & English, A. A. (1958). Comprehensive dictionary of psy-chological and psychoanalytical terms. New York, NY: Longmans,Green.

Evers, A., Sijtsma, K., Lucassen, W., & Meijer, R. R. (2010). The Dutchreview process for evaluating the quality of psychological tests: History,procedure, and results. International Journal of Testing, 10, 295–317.doi:10.1080/15305058.2010.518325

Fernberger, S. W. (1932). The American Psychological Association: Ahistorical summary, 1892–1930. Psychological Bulletin, 29, 1–89. doi:10.1037/h0075733

Fitzpatrick, A. R. (1983). The meaning of content validity. Applied Psy-chological Measurement, 7, 3–13. doi:10.1177/014662168300700102

Foster, S. L., & Cone, J. D. (1995). Validity issues in clinical assessment.Psychological Assessment, 7, 248–260. doi:10.1037/1040-3590.7.3.248

Freebody, P., & Wyatt-Smith, C. (2004). The assessment of literacy:Working the zone between “system” and “site” validity. Journal ofEducational Enquiry, 5, 30–49.

Frisbie, D. A. (2005). Measurement 101: Some fundamentals revisited.Educational Measurement: Issues and Practice, 24, 21–28. doi:10.1111/j.1745-3992.2005.00016.x

Gaylord, R. H., & Stunkel, E. R. (1954). Validity and the criterion.Educational and Psychological Measurement, 14, 294 –300. doi:10.1177/001316445401400209

Greene, H. A., Jorgensen, A. N., & Gerberich, J. R. (1943). Measurementand evaluation in the secondary school. New York, NY: Longmans,Green.

Guilford, J. P. (1946). New standards for test evaluation. Educational andPsychological Measurement, 6, 427–438.

Guilford, J. P. (1954). Psychometric methods (2nd ed.). New York, NY:McGraw-Hill.

Guion, R. M. (1977a). Content validity—The source of my discontent.Applied Psychological Measurement, 1, 1–10. doi:10.1177/014662167700100103

Guion, R. M. (1977b). Content validity: Three years of talk—What’s theaction? Public Personnel Management, 6, 407–414.

Guion, R. M. (1980). On Trinitarian doctrines of validity. ProfessionalPsychology, 11, 385–398. doi:10.1037/0735-7028.11.3.385

Guion, R. M. (2009). Was this trip really necessary? Industrial andOrganizational Psychology, 2, 465–468. doi:10.1111/j.1754-9434.2009.01174.x

Guion, R. M. (2011). Assessment, measurement, and prediction for per-sonnel decisions (2nd ed.). Hove, England: Routledge.

Gulliksen, H. (1950). Intrinsic validity. American Psychologist, 5, 511–517. doi:10.1037/h0054604

Guttman, L. (1950). The problem of attitude and opinion measurement. InS. A. Stouffer et al. (Eds.), Studies in social psychology in World War II:Vol. 4. Measurement and prediction (pp. 46–59). Princeton, NJ: Prince-ton University Press.

Halkidi, M., Batistakis, Y., & Vazirgiannis, M. (2002). Cluster validitymethods: Part 1. SIGMOD Record, 31, 40–45. doi:10.1145/565117.565124

Hambleton, R. K. (1980). Test score validity and standard-setting methods.In R. A. Berk (Ed.), Criterion-referenced measurement: The state of theart (pp. 80–123). Baltimore, MD: Johns Hopkins University Press.

Hanlon, C., Medhin, G., Alem, A., Araya, M., Abdulahi, A., Hughes, M.,. . . Prince, M. (2008). Detecting perinatal common mental disorders inEthiopia: Validation of the Self-Reporting Questionnaire and EdinburghPostnatal Depression Scale. Journal of Affective Disorders, 108, 251–262. doi:10.1016/j.jad.2007.10.023

Hill, H. C., Dean, C., & Gaffney, I. M. (2007). Assessing elemental andstructural validity: Data from teachers, non-teachers, and mathemati-cians. Measurement: Interdisciplinary Research and Perspectives, 5,81–92. doi:10.1080/15366360701486999

Hoffman, R. G., & Davis, G. L. (1995). Prospective validity study: CPIWork Orientation and Managerial Potential Scales. Educationaland Psychological Measurement, 55, 881– 890. doi:10.1177/0013164495055005024

Hogan, T. P., & Agnello, J. (2004). An empirical study of reportingpractices concerning measurement validity. Educational and Psycholog-ical Measurement, 64, 802–812. doi:10.1177/0013164404264120

Holtzman, N. A., & Watson, M. S. (Eds.). (1997). Promoting safe andeffective genetic testing in the United States: Final report of the TaskForce on Genetic Testing. Retrieved from http://www.genome.gov/10001733

Hopwood, C. J., Baker, K. L., & Morey, L. C. (2008). Extratest validity ofselected personality assessment inventory scales and indicators in aninpatient substance abuse setting. Journal of Personality Assessment, 90,574–577. doi:10.1080/00223890802388533

Hubley, A. M., & Zumbo, B. D. (1996). A dialectic on validity: Where wehave been and where we are going. Journal of General Psychology, 123,207–215. doi:10.1080/00221309.1996.9921273

Jolliffe, D., Farrington, D. P., Hawkins, J. D., Catalano, R. F., Hill, K. G.,& Kosterman, R. (2003). Predictive, concurrent, prospective and retro-spective validity of self-reported delinquency. Criminal Behaviour andMental Health, 13, 179–197. doi:10.1002/cbm.541

Jonson, J. L., & Plake, B. S. (1998). A historical comparison of validitystandards and validity practices. Educational and Psychological Mea-surement, 58, 736–753. doi:10.1177/0013164498058005002

Julnes, G. (2011). Reframing validity in research and evaluation: A mul-tidimensional, systematic model of valid inference. In H. T. Chen, S. I.Donaldson, & M. M. Mark (Eds.), Advancing validity in outcome eval-uation: Theory and practice (pp. 55–67). Hoboken, NJ: Wiley.

Kane, M. (2001). Current concerns in validity theory. Journal of Educa-tional Measurement, 38, 319 –342. doi:10.1111/j.1745-3984.2001.tb01130.x

Kane, M. (2004). The analysis of interpretive arguments: Some obser-vations inspired by the comments. Measurement: InterdisciplinaryResearch and Perspectives, 2, 192–200. doi:10.1207/s15366359mea0203_3

Kane, M. (2008). Terminology, emphasis, and utility in validation. Edu-cational Researcher, 37, 76–82. doi:10.3102/0013189X08315390

Kane, M. (2009). Validating the interpretations and uses of test scores. InR. W. Lissitz (Ed.), The concept of validity: Revisions, new directions,and applications (pp. 39–64). Charlotte, NC: Information Age.

Kane, M. (2012). All validity is construct validity. Or is it? Measurement:Interdisciplinary Research and Perspectives, 10, 66 –70. doi:10.1080/15366367.2012.681977

Thi

sdo

cum

ent

isco

pyri

ghte

dby

the

Am

eric

anPs

ycho

logi

cal

Ass

ocia

tion

oron

eof

itsal

lied

publ

ishe

rs.

Thi

sar

ticle

isin

tend

edso

lely

for

the

pers

onal

use

ofth

ein

divi

dual

user

and

isno

tto

bedi

ssem

inat

edbr

oadl

y.

317STANDARDS FOR TALKING AND THINKING ABOUT VALIDITY

Page 18: Standards for talking and thinking about validity

Kane, M. (2013). Validating the interpretations and uses of test scores.Journal of Educational Measurement, 50, 1–73. doi:10.1111/jedm.12000

Karelitz, T. M., Parrish, D. M., Yamada, H., & Wilson, M. (2010).Articulating assessments across childhood: The cross-age validity of theDesired Results Developmental Profile–Revised. Educational Assess-ment, 15, 1–26. doi:10.1080/10627191003673208

Kvale, S. (1995). The social construction of validity. Qualitative Inquiry,1, 19–40. doi:10.1177/107780049500100103

Landy, F. L. (1986). Stamp collecting versus science: Validation as hy-pothesis testing. American Psychologist, 41, 1183–1192. doi:10.1037/0003-066X.41.11.1183

Larsen, K. R., Nevo, D., & Rich, E. (2008). Exploring the semantic validityof questionnaire scales. In R. H. Sprague, Jr. (Ed.), Proceedings of the41st Hawaii International Conference on System Sciences [CD]. Wash-ington, DC: IEEE Computer Society. doi:10.1109/HICSS.2008.165

Lather, P. (1986). Issues of validity in openly ideological research: Be-tween a rock and a hard place. Interchange, 17, 63–84. doi:10.1007/BF01807017

Lather, P. (1993). Fertile obsession: validity after poststructuralism. TheSociological Quarterly, 34, 673–693.

Lawshe, C. H. (1952). Employee selection. Personnel Psychology, 5,31–34. doi:10.1111/j.1744-6570.1952.tb00990.x

Lawshe, C. H. (1985). Inferences from personnel tests and their validity.Journal of Applied Psychology, 70, 237–238.

Lennon, R. T. (1956). Assumptions underlying the use of content validity.Educational and Psychological Measurement, 16, 294 –304. doi:10.1177/001316445601600303

Lievens, F., Buyse, T., & Sackett, P. R. (2005). The operational validity ofa video-based situational judgment test for medical college admission:Illustrating the importance of matching predictor and criterion constructdomains. Journal of Applied Psychology, 90, 442–452. doi:10.1037/0021-9010.90.3.442

Lindquist, E. F. (1936). The theory of test construction. In H. E. Hawkes,E. F. Lindquist, & C. R. Mann (Eds.), The construction and use ofachievement examinations: A manual for secondary school teachers (pp.17–106). Cambridge, MA: Riverside Press.

Linn, R. L. (1978). Single-group validity, differential validity, and differ-ential prediction. Journal of Applied Psychology, 63, 507–512. doi:10.1037/0021-9010.63.4.507

Linn, R. L. (1997). Evaluating the validity of assessments: The conse-quences of use. Educational Measurement: Issues and Practice, 16,14–16. doi:10.1111/j.1745-3992.1997.tb00587.x

Lissitz, R. W., & Samuelsen, K. (2007). A suggested change in terminol-ogy and emphasis regarding validity and education. Educational Re-searcher, 36, 437–448. doi:10.3102/0013189X07311286

Loevinger, J. (1957). Objective tests as instruments of psychological the-ory. Psychological Reports, 3(Suppl. 9), 635–694. doi:10.2466/pr0.1957.3.3.635

Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental testscores. Reading, MA: Addison-Wesley.

MacPhail, F. (1998). Moving beyond statistical validity in economics.Social Indicators Research, 45, 119 –149. doi:10.1023/A:1006989612799

Maguire, T., Hattie, J., & Haig, B. (1994). Construct validity and achieve-ment assessment. Alberta Journal of Educational Research, 40(2), 109–126.

Maraun, M. D., Slaney, K. L., & Gabriel, S. M. (2009). The Augustinianmethodological family of psychology. New Ideas in Psychology, 27,148–162. doi:10.1016/j.newideapsych.2008.04.011

Marcoulides, G. A. (2004). Conceptual debates in evaluating measurementprocedures. Measurement: Interdisciplinary Research and Perspectives,2, 182–184. doi:10.1207/s15366359mea0203_2

Markus, K. A. (2012). Constructs and attributes in test validity: Reflectionson Newton’s account. Measurement: Interdisciplinary Research andPerspectives, 10, 84–87. doi:10.1080/15366367.2012.677348

Markus, M. L., & Robey, D. (1980). The organizational validity of man-agement information systems. Cambridge, MA: Massachusetts Instituteof Technology, Center for Information Systems Research.

Maxwell, J. A. (1992). Understanding and validity in qualitative research.Harvard Educational Review, 62, 279–300.

McCrae, R. R. (1982). Consensual validation of personality traits: Evi-dence from self-reports and ratings. Journal of Personality and SocialPsychology, 43, 293–303. doi:10.1037/0022-3514.43.2.293

McIntire, S. A., & Miller, L. A. (2007). Foundations of psychologicaltesting: A practical approach (2nd ed.). Thousand Oaks, CA: Sage.

Mehrens, W. A. (1997). The consequences of consequential validity.Educational Measurement: Issues and Practice, 16, 16 –18. doi:10.1111/j.1745-3992.1997.tb00588.x

Messick, S. (1975). The standard problem: Meaning and values in mea-surement and evaluation. American Psychologist, 30, 955–966. doi:10.1037/0003-066X.30.10.955

Messick, S. (1980). Test validity and the ethics of assessment. AmericanPsychologist, 35, 1012–1027. doi:10.1037/0003-066X.35.11.1012

Messick, S. (1981). Evidence and ethics in the evaluation of tests. Educa-tional Researcher, 10, 9–20. doi:10.3102/0013189X010009009

Messick, S. (1988). The once and future issues of validity: Assessing themeaning and consequences of measurement. In H. Wainer & H. I. Braun(Eds.), Test validity (pp. 33–48). Hillsdale, NJ: Erlbaum.

Messick, S. (1989). Validity. In R. Linn (Ed.), Educational measurement(3rd ed., pp. 13–103). Washington, DC: American Council on Educa-tion.

Messick, S. (1992). Validity of test interpretation and use. In M. C. Alkin(Ed.), Encyclopedia of educational research (6th ed., Vol. 4, pp. 1487–1495). New York, NY: Macmillan.

Messick, S. (1998). Test validity: A matter of consequences. Social Indi-cators Research, 45, 35–44. doi:10.1023/A:1006964925094

Miller, M. D., Linn, R. L., & Gronlund, N. E. (2009). Measurement andassessment in teaching (10th ed.). Upper Saddle River, NJ: PearsonEducation.

Mosier, C. I. (1947). A critical examination of the concepts of face validity.Educational and Psychological Measurement, 7, 191–205. doi:10.1177/001316444700700201

Moss, P. A. (1995). Themes and variations in validity theory. EducationalMeasurement: Issues and Practice, 14, 5–13. doi:10.1111/j.1745-3992.1995.tb00854.x

Murphy, K. R. (2009). Content validation is useful for many things, butvalidity isn’t one of them. Industrial and Organizational Psychology, 2,453–464. doi:10.1111/j.1754-9434.2009.01173.x

Nevo, B. (1985). Face validity revisited. Journal of Educational Measure-ment, 22, 287–293. doi:10.1111/j.1745-3984.1985.tb01065.x

Newton, P. E. (2012a). Clarifying the consensus definition of validity.Measurement: Interdisciplinary Research and Perspectives, 10, 1–29.doi:10.1080/15366367.2012.669666

Newton, P. E. (2012b). Questioning the consensus definition of validity.Measurement: Interdisciplinary Research and Perspectives, 10, 110–122. doi:10.1080/15366367.2012.688456

Pollitt, A. (2012). Validity cannot be created, it can only be lost. Measure-ment: Interdisciplinary Research and Perspectives, 10, 100–103. doi:10.1080/15366367.2012.686868

Popham, W. J. (1978). Criterion-referenced measurement. EnglewoodCliffs, NJ: Prentice-Hall.

Popham, W. J. (1997). Consequential validity: Right concern—wrongconcept. Educational Measurement: Issues and Practice, 16, 9–13.doi:10.1111/j.1745-3992.1997.tb00586.x

Reynolds, C. R., Livingston, R. B., & Willson, V. (2010). Measurementand assessment in education (2nd ed.). Upper Saddle River, NJ: Pearson.

Thi

sdo

cum

ent

isco

pyri

ghte

dby

the

Am

eric

anPs

ycho

logi

cal

Ass

ocia

tion

oron

eof

itsal

lied

publ

ishe

rs.

Thi

sar

ticle

isin

tend

edso

lely

for

the

pers

onal

use

ofth

ein

divi

dual

user

and

isno

tto

bedi

ssem

inat

edbr

oadl

y.

318 NEWTON AND SHAW

Page 19: Standards for talking and thinking about validity

Richardson, M. W. (1936). The relation between the difficulty and thedifferential validity of a test. Psychometrika, 1, 33–49. doi:10.1007/BF02288003

Rosenberg, M. (1979). Conceiving the self. New York, NY: Basic Books.Rulon, P. J. (1946). On the validity of educational tests. Harvard Educa-

tional Review, 16, 290–296.Scriven, M. (2002). Assessing six assumptions in assessment. In H. I.

Braun, D. N. Jackson, & D. E. Wiley (Eds.), The role of constructs inpsychological and educational measurement (pp. 255–275). Mahwah,NJ: Erlbaum.

Sechrest, L. (1963). Incremental validity: A recommendation. Educationaland Psychological Measurement, 23, 153–158. doi:10.1177/001316446302300113

Shaw, D. J., & Linden, J. D. (1964). A critique of the Hand Test.Educational and Psychological Measurement, 24, 283–284. doi:10.1177/001316446402400209

Shaw, S., & Weir, C. J. (2007). Examining writing: Research and practicein assessing second language writing. Cambridge, England: CambridgeUniversity Press.

Shepard, L. A. (1993). Evaluating test validity. Review of Research inEducation, 19, 405–450.

Shepard, L. A. (1997). The centrality of test use and consequences for testvalidity. Educational Measurement: Issues and Practice, 16, 5–24.doi:10.1111/j.1745-3992.1997.tb00585.x

Sireci, S. G. (1998). The construct of content validity. Social IndicatorsResearch, 45, 83–117. doi:10.1023/A:1006985528729

Sireci, S. G. (2007). On validity theory and test validation. EducationalResearcher, 36, 477–481. doi:10.3102/0013189X07311609

Sireci, S. G. (2009). Packing and unpacking sources of validity evidence:History repeats itself again. In R. W. Lissitz (Ed.), The concept ofvalidity: Revisions, new directions, and applications (pp. 19–37). Char-lotte, NC: Information Age.

Smith, G. T. (2005). On construct validity: Issues of method and measure-ment. Psychological Assessment, 17, 396–408. doi:10.1037/1040-3590.17.4.396

Tenopyr, M. L. (1986). Needed directions for measurement in work set-tings. In J. V. Mitchell, Jr. (Series Ed.) & B. S. Plake & J. C. Witt (Vol.Eds.), Buros-Nebraska Symposium on Measurement and Testing: Vol. 2.The future of testing (pp. 269–288). Hillsdale, NJ: Erlbaum.

Thurstone, L. L. (1931). The reliability and validity of tests: Derivationand interpretation of fundamental formulae concerned with reliabilityand validity of tests and illustrative problems. Ann Arbor, MI: Edwards.doi:10.1037/11418-000

Trochim, W. M. (2006). The research methods knowledge base (2nd ed.).Retrieved from http://www.socialresearchmethods.net/kb/

Tryon, R. C. (1957a). Communality of a variable: Formulation by clusteranalysis. Psychometrika, 22, 241–260. doi:10.1007/BF02289125

Tryon, R. C. (1957b). Reliability and behavior domain validity: Reformu-lation and historical critique. Psychological Bulletin, 54, 229–249. doi:10.1037/h0047980

Waluchow, W. J. (2009). Four concepts of validity: Reflections on inclu-sive and exclusive positivism. In M. D. Adler & K. E. Himma (Eds.),The rule of recognition and the United States Constitution (pp. 123–143). Oxford, England: Oxford University Press. doi:10.1093/acprof:oso/9780195343298.003.0005

Watson, G., & Forlano, G. (1935). Prima facie validity in character tests.Journal of Educational Psychology, 26, 1–16. doi:10.1037/h0057103

Willcutt, E. G., & Carlson, C. L. (2005). The diagnostic validity ofattention-deficit/hyperactivity disorder. Clinical Neuroscience Research,5, 219–232. doi:10.1016/j.cnr.2005.09.003

Wolming, S., & Wikstrom, C. (2010). The concept of validity in theory andpractice. Assessment in Education: Principles, Policy and Practice, 17,117–132.

Woody, C. (1935). A symposium on the effects of measurement oninstruction. Journal of Educational Research, 28, 481–483.

Yalow, E., & Popham, W. J. (1983). Content validity at the crossroads.Educational Researcher, 12, 10–21. doi:10.3102/0013189X012008010

Zumbo, B. D. (2009). Validity as contextualized and pragmatic explana-tion, and its implications for validation practice. In R. W. Lissitz (Ed.),The concept of validity: Revisions, new directions, and applications (pp.65–82). Charlotte, NC: Information Age.

Received May 22, 2012Revision received March 26, 2013

Accepted April 7, 2013 �

Thi

sdo

cum

ent

isco

pyri

ghte

dby

the

Am

eric

anPs

ycho

logi

cal

Ass

ocia

tion

oron

eof

itsal

lied

publ

ishe

rs.

Thi

sar

ticle

isin

tend

edso

lely

for

the

pers

onal

use

ofth

ein

divi

dual

user

and

isno

tto

bedi

ssem

inat

edbr

oadl

y.

319STANDARDS FOR TALKING AND THINKING ABOUT VALIDITY