validity of psychological assessmentpassage is well-known to some readers or the musical score for a...

9
Validity of Psychological Assessment Validation of Inferences From Persons' Responses and Performances as Scientific Inquiry Into Score Meaning Samuel Messick Educational Testing Service The traditional conception of validity divides it into three separate and substitutable typesnamely, content, cri- terion, and construct validities. This view is fragmented and incomplete, especially because it fails to take into account both evidence of the value implications of score meaning as a basis for action and the social consequences of score use. The new unified concept of validity interre- lates these issues as fundamental aspects of a more com- prehensive theory of construct validity that addresses both score meaning and social values in test interpretation and test use. That is, unified validity integrates considerations of content, criteria, and consequences into a construct framework for the empirical testing of rational hypotheses about score meaning and theoretically relevant relation- ships, including those of an applied and a scientific nature. Six distinguishable aspects of construct validity are high- lighted as a means of addressing central issues implicit in the notion of validity as a unified concept. These are content, substantive, structural, generalizability, external, and consequential aspects of construct validity. In effect, these six aspects function as general validity criteria or standards for all educational and psychological measure- ment, including performance assessments, which are dis- cussed in some detail because of their increasing emphasis in educational and employment settings. V alidity is an overall evaluative judgment of the degree to which empirical evidence and theoret- ical rationales support the adequacy and appro- priateness of interpretations and actions on the basis of test scores or other modes of assessment (Messick, 1989b). Validity is not a property of the test or assessment as such, but rather of the meaning of the test scores. These scores are a function not only of the items or stimulus conditions, but also of the persons responding as well as the context of the assessment. In particluar, what needs to be valid is the meaning or interpretation of the score; as well as any implications for action that this meaning entails (Cronbach, 1971). The extent to which score meaning and action implications hold across persons or population groups and across settings or contexts is a persistent and perennial empirical question. This is the main reason that validity is an evolving property and val- idation a continuing process. The Value of Validity The principles of validity apply not just to interpretive and action inferences derived from test scores as ordinarily conceived, but also to inferences based on any means of observing or documenting consistent behaviors or attri- butes. Thus, the term score is used generically in its broadest sense to mean any coding or summarization of observed consistencies or performance regularities on a test, questionnaire, observation procedure, or other as- sessment devices such as work samples, portfolios, and realistic problem simulations. This general usage subsumes qualitative as well as quantitative summaries. It applies, for example, to behavior protocols, to clinical appraisals, to computerized verbal score reports, and to behavioral or performance judgments or ratings. Scores in this sense are not limited to behavioral consistencies and attributes of persons (e.g., persistence and verbal ability). Scores may also refer to functional consis- tencies and attributes of groups, situations or environments, and objects or institutions, as in measures of group solidarity, situational stress, quality of artistic products, and such social indicators as school dropout rate. Hence, the principles of validity apply to all assess- ments, including performance assessments. For example, student portfolios are often the source of inferences—not just about the quality of the included products but also about the knowledge, skills, or other attributes of the stu- dent—and such inferences about quality and constructs need to meet standards of validity. This is important be- cause performance assessments, although long a staple of industrial and military applications, are now touted as purported instruments of standards-based education re- form because they promise positive consequences for teaching and learning. Indeed, it is precisely because of Editor's note. Samuel M. Turner served as action editor for this article. This article was presented as a keynote address at the Conference on Contemporary Psychological Assessment, Arbetspsykologiska Utveck- lingsinstitutet, June 7-8, 1994, Stockholm, Sweden. Author's note. Acknowledgments are gratefully extended to Isaac Bejar, Randy Bennett, Drew Gitomer, Ann Jungeblut, and Michael Zieky for their reviews of various versions of the manuscript. Correspondence concerning this article should be addressed to Sa- muel Messick, Educational Testing Service, Princeton, NJ 08541. September 1995 • American Psychologist Copyright 1995 by the American Psychological Association, Inc. 0003-066X/95/S2.00 Vol. 50. No. 9. 741-749 741

Upload: others

Post on 21-Feb-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Validity of Psychological Assessmentpassage is well-known to some readers or the musical score for a sight reading exercise invokes a well-drilled rendition for some performers. Construct-irrelevant

Validity of Psychological AssessmentValidation of Inferences From Persons' Responses and Performances as

Scientific Inquiry Into Score Meaning

Samuel MessickEducational Testing Service

The traditional conception of validity divides it into threeseparate and substitutable types—namely, content, cri-terion, and construct validities. This view is fragmentedand incomplete, especially because it fails to take intoaccount both evidence of the value implications of scoremeaning as a basis for action and the social consequencesof score use. The new unified concept of validity interre-lates these issues as fundamental aspects of a more com-prehensive theory of construct validity that addresses bothscore meaning and social values in test interpretation andtest use. That is, unified validity integrates considerationsof content, criteria, and consequences into a constructframework for the empirical testing of rational hypothesesabout score meaning and theoretically relevant relation-ships, including those of an applied and a scientific nature.Six distinguishable aspects of construct validity are high-lighted as a means of addressing central issues implicitin the notion of validity as a unified concept. These arecontent, substantive, structural, generalizability, external,and consequential aspects of construct validity. In effect,these six aspects function as general validity criteria orstandards for all educational and psychological measure-ment, including performance assessments, which are dis-cussed in some detail because of their increasing emphasisin educational and employment settings.

Validity is an overall evaluative judgment of thedegree to which empirical evidence and theoret-ical rationales support the adequacy and appro-

priateness of interpretations and actions on the basis oftest scores or other modes of assessment (Messick, 1989b).Validity is not a property of the test or assessment assuch, but rather of the meaning of the test scores. Thesescores are a function not only of the items or stimulusconditions, but also of the persons responding as well asthe context of the assessment. In particluar, what needsto be valid is the meaning or interpretation of the score;as well as any implications for action that this meaningentails (Cronbach, 1971). The extent to which scoremeaning and action implications hold across persons orpopulation groups and across settings or contexts is apersistent and perennial empirical question. This is themain reason that validity is an evolving property and val-idation a continuing process.

The Value of ValidityThe principles of validity apply not just to interpretiveand action inferences derived from test scores as ordinarilyconceived, but also to inferences based on any means ofobserving or documenting consistent behaviors or attri-butes. Thus, the term score is used generically in itsbroadest sense to mean any coding or summarization ofobserved consistencies or performance regularities on atest, questionnaire, observation procedure, or other as-sessment devices such as work samples, portfolios, andrealistic problem simulations.

This general usage subsumes qualitative as well asquantitative summaries. It applies, for example, to behaviorprotocols, to clinical appraisals, to computerized verbal scorereports, and to behavioral or performance judgments orratings. Scores in this sense are not limited to behavioralconsistencies and attributes of persons (e.g., persistence andverbal ability). Scores may also refer to functional consis-tencies and attributes of groups, situations or environments,and objects or institutions, as in measures of group solidarity,situational stress, quality of artistic products, and such socialindicators as school dropout rate.

Hence, the principles of validity apply to all assess-ments, including performance assessments. For example,student portfolios are often the source of inferences—notjust about the quality of the included products but alsoabout the knowledge, skills, or other attributes of the stu-dent—and such inferences about quality and constructsneed to meet standards of validity. This is important be-cause performance assessments, although long a staple ofindustrial and military applications, are now touted aspurported instruments of standards-based education re-form because they promise positive consequences forteaching and learning. Indeed, it is precisely because of

Editor's note. Samuel M. Turner served as action editor for this article.This article was presented as a keynote address at the Conference on

Contemporary Psychological Assessment, Arbetspsykologiska Utveck-lingsinstitutet, June 7-8, 1994, Stockholm, Sweden.

Author's note. Acknowledgments are gratefully extended to Isaac Bejar,Randy Bennett, Drew Gitomer, Ann Jungeblut, and Michael Zieky fortheir reviews of various versions of the manuscript.

Correspondence concerning this article should be addressed to Sa-muel Messick, Educational Testing Service, Princeton, NJ 08541.

September 1995 • American PsychologistCopyright 1995 by the American Psychological Association, Inc. 0003-066X/95/S2.00Vol. 50. No. 9. 741-749

741

Page 2: Validity of Psychological Assessmentpassage is well-known to some readers or the musical score for a sight reading exercise invokes a well-drilled rendition for some performers. Construct-irrelevant

Comprehensiveness of ConstructComoroValidity

SamuelMessickPhoto by WilliamMonachan, EducationalTesting Service,Princeton, NJ.

such politically salient potential consequences that thevalidity of performance assessment needs to be system-atically addressed, as do other basic measurement issuessuch as reliability, comparability, and fairness. The latterreference to fairness broaches a broader set of equity issuesin testing that includes fairness of test use, freedom frombias in scoring and interpretation, and the appropriatenessof the test-based constructs or rules underlying decisionmaking or resource allocation, that is, distributive justice(Messick, 1989b).

These issues are critical for performance assess-ment—as they are for all educational and psychologicalassessment—because validity, reliability, comparability,and fairness are not just measurement principles, theyare social values that have meaning and force outside ofmeasurement whenever evaluative judgments and deci-sions are made. As a salient social value, validity assumesboth a scientific and a political role that can by no meansbe fulfilled by a simple correlation coefficient betweentest scores and a purported criterion (i.e., classical cri-terion-related validity) or by expert judgments that testcontent is relevant to the proposed test use (i.e., traditionalcontent validity).

Indeed, validity is broadly defined as nothing lessthan an evaluative summary of both the evidence for andthe actual—as well as potential—consequences of scoreinterpretation and use (i.e., construct validity conceivedcomprehensively). This comprehensive view of validityintegrates considerations of content, criteria, and con-sequences into a construct framework for empiricallytesting rational hypotheses about score meaning and util-ity. Therefore, it is fundamental that score validation isan empirical evaluation of the meaning and consequencesof measurement. As such, validation combines scientificinquiry with rational argument to justify (or nullify) scoreinterpretation and use.

In principle as well as in practice, construct validity isbased on an integration of any evidence that bears on theinterpretation or meaning of the test scores—includingcontent- and criterion-related evidence—which are thussubsumed as part of construct validity. In construct val-idation the test score is not equated with the construct itattempts to tap, nor is it considered to define the con-struct, as in strict operationism (Cronbach & Meehl,1955). Rather, the measure is viewed as just one of anextensible set of indicators of the construct. Convergentempirical relationships reflecting communality amongsuch indicators are taken to imply the operation of theconstruct to the degree that discriminant evidence dis-counts the intrusion of alternative constructs as plausiblerival hypotheses.

A fundamental feature of construct validity is con-struct representation, whereby one attempts to identifythrough cognitive-process analysis or research on person-ality and motivation the theoretical mechanisms under-lying task performance, primarily by decomposing thetask into requisite component processes and assemblingthem into a functional model or process theory (Em-bretson, 1983). Relying heavily on the cognitive psy-chology of information processing, construct represen-tation refers to the relative dependence of task responseson the processes, strategies, and knowledge (includingmetacognitive or self-knowledge) that are implicated intask performance.

Sources of Invalidity

There are two major threats to construct validity: In theone known as construct underrepresentation, the assess-ment is too narrow and fails to include important di-mensions or facets of the construct. In the threat to va-lidity known as construct-irrelevant variance, the assess-ment is too broad, containing excess reliable varianceassociated with other distinct constructs as well as methodvariance such as response sets or guessing propensitiesthat affects responses in a manner irrelevant to the in-terpreted construct. Both threats are operative in all as-sessments. Hence a primary validation concern is the ex-tent to which the same assessment might underrepresentthe focal construct while simultaneously contaminatingthe scores with construct-irrelevant variance.

There are two basic kinds of construct-irrelevant vari-ance. In the language of ability and achievement testing,these might be called construct-irrelevant difficulty and con-struct-irrelevant easiness. In the former, aspects of the taskthat are extraneous to the focal construct make the taskirrelevantly difficult for some individuals or groups. An ex-ample is the intrusion of undue reading comprehensionrequirements in a test of subject matter knowledge. In gen-eral, construct-irrelevant difficulty leads to construct scoresthat are invalidly low for those individuals adversely affected(e.g., knowledge scores of poor readers or examinees withlimited English proficiency). Of course, if concern is solely

742 September 1995 • American Psychologist

Page 3: Validity of Psychological Assessmentpassage is well-known to some readers or the musical score for a sight reading exercise invokes a well-drilled rendition for some performers. Construct-irrelevant

with criterion prediction and the criterion performance re-quires reading skill as well as subject matter knowledge,then both sources of variance would be considered criterion-relevant and valid. However, for score interpretations interms of subject matter knowledge and for any score usesbased thereon, undue reading requirements would constituteconstruct-irrelevant difficulty.

Indeed, construct-irrelevant difficulty for individualsand groups is a major source of bias in test scoring andinterpretation and of unfairness in test use. Differencesin construct-irrelevant difficulty for groups, as distinctfrom construct-relevant group differences, is the majorculprit sought in analyses of differential item functioning(Holland & Wainer, 1993).

In contrast, construct-irrelevant easiness occurswhen extraneous clues in item or task formats permitsome individuals to respond correctly or appropriately inways irrelevant to the construct being assessed. Anotherinstance occurs when the specific test material, either de-liberately or inadvertently, is highly familiar to some re-spondents, as when the text of a reading comprehensionpassage is well-known to some readers or the musical scorefor a sight reading exercise invokes a well-drilled renditionfor some performers. Construct-irrelevant easiness leadsto scores that are invalidly high for the affected individualsas reflections of the construct under scrutiny.

The concept of construct-irrelevant variance is im-portant in all educational and psychological measure-ment, including performance assessments. This is es-pecially true of richly contextualized assessments and so-called "authentic" simulations of real-world tasks. Thisis the case because "paradoxically, the complexity of con-text is made manageable by contextual clues" (Wiggins,1993, p. 208). And it matters whether the contextual cluesthat people respond to are construct-relevant or representconstruct-irrelevant difficulty or easiness.

However, what constitutes construct-irrelevant vari-ance is a tricky and contentious issue (Messick, 1994).This is especially true of performance assessments, whichtypically invoke constructs that are higher order andcomplex in the sense of subsuming or organizing multipleprocesses. For example, skill in communicating mathe-matical ideas might well be considered irrelevant variancein the assessment of mathematical knowledge (althoughnot necessarily vice versa). But both communication skilland mathematical knowledge are considered relevantparts of the higher-order construct of mathematical power,according to the content standards delineated by the Na-tional Council of Teachers of Mathematics (1989). It alldepends on how compelling the evidence and argumentsare that the particular source of variance is a relevantpart of the focal construct, as opposed to affording a plau-sible rival hypothesis to account for the observed perfor-mance regularities and relationships with other variables.

A further complication arises when construct-irrel-evant variance is deliberately capitalized upon to producedesired social consequences, as in score adjustments forminority groups, within-group norming, or sliding bandprocedures (Cascio, Outtz, Zedeck, & Goldstein, 1991;

Hartigan & Wigdor, 1989; Schmidt, 1991). However, rec-ognizing that these adjustments distort the meaning ofthe construct as originally assessed, psychologists shoulddistinguish such controversial procedures in applied test-ing practice (Gottfredson, 1994; Sackett & Wilk, 1994)from the valid assessment of focal constructs and fromany score uses based on that construct meaning. Con-struct-irrelevant variance is always a source of invalidityin the assessment of construct meaning and its actionimplications. These issues portend the substantive andconsequential aspects of construct validity, which are dis-cussed in more detail later.

Sources of Evidence in Construct Validity

In essence, construct validity comprises the evidence andrationales supporting the trustworthiness of score interpre-tation in terms of explanatory concepts that account forboth test performance and score relationships with othervariables. In its simplest terms, construct validity is the ev-idential basis for score interpretation. As an integration ofevidence for score meaning, it applies to any score inter-pretation—not just those involving so-called "theoreticalconstructs." Almost any kind of information about a testcan contribute to an understanding of score meaning, butthe contribution becomes stronger if the degree of fit of theinformation with the theoretical rationale underlying scoreinterpretation is explicitly evaluated (Cronbach, 1988; Kane,1992; Messick, 1989b). Historically, primary emphasis inconstruct validation has been placed on internal and externaltest structures—that is, on the appraisal of theoretically ex-pected patterns of relationships among item scores or be-tween test scores and other measures.

Probably even more illuminating in regard to scoremeaning are studies of expected performance differencesover time, across groups and settings, and in response toexperimental treatments and manipulations. For exam-ple, over time one might demonstrate the increased scoresfrom childhood to young adulthood expected for mea-sures of impulse control. Across groups and settings, onemight contrast the solution strategies of novices versusexperts for measures of domain problem-solving or, formeasures of creativity, contrast the creative productionsof individuals in self-determined as opposed to directivework environments. With respect to experimental treat-ments and manipulations, one might seek increasedknowledge scores as a function of domain instruction orincreased achievement motivation scores as a function ofgreater benefits and risks. Possibly most illuminating ofall, however, are direct probes and modeling of the pro-cesses underlying test responses, which are becoming bothmore accessible and more powerful with continuing de-velopments in cognitive psychology (Frederiksen, Mislevy,& Bejar, 1993; Snow & Lohman, 1989). At the simplestlevel, this might involve querying respondents about theirsolution processes or asking them to think aloud whileresponding to exercises during field trials.

In addition to reliance on these forms of evidence,construct validity, as previously indicated, also subsumescontent relevance and representativeness as well as cri-

September 1995 • American Psychologist 743

Page 4: Validity of Psychological Assessmentpassage is well-known to some readers or the musical score for a sight reading exercise invokes a well-drilled rendition for some performers. Construct-irrelevant

terion-relatedness. This is the case because such infor-mation about the range and limits of content coverageand about specific criterion behaviors predicted by thetest scores clearly contributes to score interpretation. Inthe latter instance, correlations between test scores andcriterion measures—viewed within the broader contextof other evidence supportive of score meaning—contrib-ute to the joint construct validity of both predictor andcriterion. In other words, empirical relationships betweenpredictor scores and criterion measures should maketheoretical sense in terms of what the predictor test isinterpreted to measure and what the criterion is presumedto embody (Gulliksen, 1950).

An important form of validity evidence still re-maining bears on the social consequences of test inter-pretation and use. It is ironic that validity theory has paidso little attention over the years to the consequential basisof test validity, because validation practice has long in-voked such notions as the functional worth of the test-ing—that is, a concern over how well the test does thejob for which it is used (Cureton, 1951; Rulon, 1946).And to appraise how well a test does its job, one mustinquire whether the potential and actual social conse-quences of test interpretation and use are not only sup-portive of the intended testing purposes, but also at thesame time consistent with other social values.

With some trepidation due to the difficulties inherentin forecasting, both potential and actual consequences areincluded in this formulation for two main reasons: First,anticipation of likely outcomes may guide one where tolook for side effects and toward what kinds of evidence areneeded to monitor consequences; second, such anticipationmay alert one to take timely steps to capitalize on positiveeffects and to ameliorate or forestall negative effects.

However, this form of evidence should not be viewedin isolation as a separate type of validity, say, of "conse-quential validity." Rather, because the values served inthe intended and unintended outcomes of test interpre-tation and use both derive from and contribute to themeaning of the test scores, appraisal of the social conse-quences of the testing is also seen to be subsumed as anaspect of construct validity (Messick, 1964, 1975, 1980).In the language of the Cronbach and Meehl (1955) sem-inal manifesto on construct validity, the intended con-sequences of the testing are strands in the construct's no-mological network representing presumed action impli-cations of score meaning. The central point is thatunintended consequences, when they occur, are alsostrands in the construct's nomological network that needto be taken into account in construct theory, score inter-pretation, and test use. At issue is evidence for not onlynegative but also positive consequences of testing, suchas the promised benefits of educational performance as-sessment for teaching and learning.

A major concern in practice is to distinguish adverseconsequences that stem from valid descriptions of indi-vidual and group differences from adverse consequencesthat derive from sources of test invalidity such as constructunderrepresentation and construct-irrelevant variance.

The latter adverse consequences of test invalidity presentmeasurement problems that need to be investigated inthe validation process, whereas the former consequencesof valid assessment represent problems of social policy.But more about this later.

Thus, the process of construct validation evolvesfrom these multiple sources of evidence a mosaic of con-vergent and discriminant findings supportive of scoremeaning. However, in anticipated applied test use, thismosaic of general evidence may or may not include per-tinent specific evidence of (a) the relevance of the test tothe particular applied purpose and (b) the utility of thetest in the applied setting. Hence, the general constructvalidity evidence may need to be buttressed in appliedinstances by specific evidence of relevance and utility.

In summary, the construct validity of score interpre-tation comes to undergird all score-based inferences—notjust those related to interpretive meaningfulness but alsothe content- and criterion-related inferences specific to ap-plied decisions and actions based on test scores. From thediscussion thus far, it should also be clear that test validitycannot rely on any one of the supplementary forms of ev-idence just discussed. However, neither does validity requireany one form, granted that there is defensible convergentand discriminant evidence supporting score meaning. Tothe extent that some form of evidence cannot be devel-oped—as when criterion-related studies must be forgonebecause of small sample sizes, unreliable or contaminatedcriteria, and highly restricted score ranges—heightened em-phasis can be placed on other evidence, especially on theconstruct validity of the predictor tests and on the relevanceof the construct to the criterion domain (Guion, 1976; Mes-sick, 1989b). What is required is a compelling argumentthat the available evidence justifies the test interpretationand use, even though some pertinent evidence had to beforgone. Hence, validity becomes a unified concept, and theunifying force is the meaningfulness or trustworthy inter-pretability of the test scores and their action implications,namely, construct validity.

Aspects of Construct ValidityHowever, to speak of validity as a unified concept does notimply that validity cannot be usefully differentiated intodistinct aspects to underscore issues and nuances that mightotherwise be downplayed or overlooked, such as the socialconsequences of performance assessments or the role of scoremeaning in applied use. The intent of these distinctions isto provide a means of addressing functional aspects of va-lidity that help disentangle some of the complexities inherentin appraising the appropriateness, meaningfulness, and use-fulness of score inferences.

In particular, six distinguishable aspects of constructvalidity are highlighted as a means of addressing centralissues implicit in the notion of validity as a unified con-cept. These are content, substantive, structural, general-izability, external, and consequential aspects of constructvalidity. In effect, these six aspects function as generalvalidity criteria or standards for all educational and psy-chological measurement (Messick, 1989b). Following a

744 September 1995 • American Psychologist

Page 5: Validity of Psychological Assessmentpassage is well-known to some readers or the musical score for a sight reading exercise invokes a well-drilled rendition for some performers. Construct-irrelevant

capsule description of these six aspects, some of the va-lidity issues and sources of evidence bearing on each arehighlighted:

• The content aspect of construct validity includesevidence of content relevance, representativeness,and technical quality (Lennon, 1956; Messick,1989b);

• The substantive aspect refers to theoretical ratio-nales for the observed consistencies in test re-sponses, including process models of task perfor-mance (Embretson, 1983), along with empiricalevidence that the theoretical processes are actuallyengaged by respondents in the assessment tasks;

• The structural aspect appraises the fidelity of thescoring structure to the structure of the constructdomain at issue (Loevinger, 1957; Messick 1989b);

• The generalizability aspect examines the extent towhich score properties and interpretations gen-eralize to and across population groups, settings,and tasks (Cook & Campbell, 1979; Shulman,1970), including validity generalization of test cri-terion relationships (Hunter, Schmidt, & Jackson,1982);

• The external aspect includes convergent and dis-criminant evidence from multitrait-multimethodcomparisons (Campbell & Fiske, 1959), as well asevidence of criterion relevance and applied utility(Cronbach & Gleser, 1965);

• The consequential aspect appraises the value im-plications of score interpretation as a basis for ac-tion as well as the actual and potential conse-quences of test use, especially in regard to sourcesof invalidity related to issues of bias, fairness, anddistributive justice (Messick, 1980, 1989b).

Content- Relevance and Representativeness

A key issue for the content aspect of construct validity isthe specification of the boundaries of the construct do-main to be assessed—that is, determining the knowledge,skills, attitudes, motives, and other attributes to be re-vealed by the assessment tasks. The boundaries andstructure of the construct domain can be addressed bymeans of job analysis, task analysis, curriculum analysis,and especially domain theory, in other words, scientificinquiry into the nature of the domain processes and theways in which they combine to produce effects or out-comes. A major goal of domain theory is to understandthe construct-relevant sources of task difficulty, whichthen serves as a guide to the rational development andscoring of performance tasks and other assessment for-mats. At whatever stage of its development, then, domaintheory is a primary basis for specifying the boundariesand structure of the construct to be assessed.

However, it is not sufficient merely to select tasksthat are relevant to the construct domain. In addition,the assessment should assemble tasks that are represen-tative of the domain in some sense. The intent is to insurethat all important parts of the construct domain are cov-

ered, which is usually described as selecting tasks thatsample domain processes in terms of their functional im-portance, or what Brunswik (1956) called ecologicalsampling. Functional importance can be considered interms of what people actually do in the performance do-main, as in job analyses, but also in terms of what char-acterizes and differentiates expertise in the domain, whichwould usually emphasize different tasks and processes.Both the content relevance and representativeness of as-sessment tasks are traditionally appraised by expertprofessional judgment, documentation of which serves toaddress the content aspect of construct validity.

Substantive Theories, Process Models, andProcess EngagementThe substantive aspect of construct validity emphasizesthe role of substantive theories and process modeling inidentifying the domain processes to be revealed in as-sessment tasks (Embretson, 1983; Messick, 1989b). Twoimportant points are involved: One is the need for tasksproviding appropriate sampling of domain processes inaddition to traditional coverage of domain content; theother is the need to move beyond traditional professionaljudgment of content to accrue empirical evidence thatthe ostensibly sampled processes are actually engaged byrespondents in task performance.

Thus, the substantive aspect adds to the content as-pect of construct validity the need for empirical evidenceof response consistencies or performance regularities re-flective of domain processes (Loevinger, 1957). Such ev-idence may derive from a variety of sources, for example,from "think aloud" protocols or eye movement recordsduring task performance; from correlation patternsamong part scores; from consistencies in response timesfor task segments; or from mathematical or computermodeling of task processes (Messick, 1989b, pp. 53-55;Snow & Lohman, 1989). In summary, the issue of domaincoverage refers not just to the content representativenessof the construct measure but also to the process repre-sentation of the construct and the degree to which theseprocesses are reflected in construct measurement.

The core concept bridging the content and substantiveaspects of construct validity is representativeness. This be-comes clear once one recognizes that the term representativehas two distinct meanings, both of which are applicable toperformance assessment. One is in the cognitive psycholo-gist's sense of representation or modeling (Suppes, Pavel, &Falmagne, 1994); the other is in the Brunswikian sense ofecological sampling (Brunswik, 1956; Snow, 1974). Thechoice of tasks or contexts in assessment is a representativesampling issue. The comprehensiveness and fidelity of sim-ulating the constructs realistic engagement in performanceis a representation issue. Both issues are important in ed-ucational and psychological measurement and especially inperformance assessment.

Scoring Models As Reflective of Task andDomain StructureAccording to the structural aspect of construct validity,scoring models should be rationally consistent with what is

September 1995 • American Psychologist 745

Page 6: Validity of Psychological Assessmentpassage is well-known to some readers or the musical score for a sight reading exercise invokes a well-drilled rendition for some performers. Construct-irrelevant

known about the structural relations inherent in behavioralmanifestations of the construct in question (Loevinger, 1957;Peak, 1953). That is, the theory of the construct domainshould guide not only the selection or construction of rel-evant assessment tasks but also the rational development ofconstruct-based scoring criteria and rubrics.

Ideally, the manner in which behavioral instancesare combined to produce a score should rest on knowledgeof how the processes underlying those behaviors combinedynamically to produce effects. Thus, the internal struc-ture of the assessment (i.e., interrelations among thescored aspects of task and subtask performance) shouldbe consistent with what is known about the internalstructure of the construct domain (Messick, 1989b). Thisproperty of construct-based rational scoring models iscalled structural fidelity (Loevinger, 1957).

Generalizability and the Boundaries of ScoreMeaning

The concern that a performance assessment should pro-vide representative coverage of the content and processesof the construct domain is meant to insure that the scoreinterpretation not be limited to the sample of assessedtasks but be broadly generalizable to the construct do-main. Evidence of such generalizability depends on thedegree of correlation of the assessed tasks with other tasksrepresenting the construct or aspects of the construct.This issue of generalizability of score inferences acrosstasks and contexts goes to the very heart of score meaning.Indeed, setting the boundaries of score meaning is pre-cisely what generalizability evidence is meant to address.

However, because of the extensive time required forthe typical performance task, there is a conflict in per-formance assessment between time-intensive depth of ex-amination and the breadth of domain coverage neededfor generalizability of construct interpretation. This con-flict between depth and breadth of coverage is often viewedas entailing a trade-off between validity and reliability (orgeneralizability). It might better be depicted as a trade-off between the valid description of the specifics of a com-plex task and the power of construct interpretation. Inany event, such a conflict signals a design problem thatneeds to be carefully negotiated in performance assess-ment (Wiggins, 1993).

In addition to generalizability across tasks, the limitsof score meaning are also affected by the degree of gen-eralizability across time or occasions and across observersor raters of the task performance. Such sources of mea-surement error associated with the sampling of tasks, oc-casions, and scorers underlie traditional reliability con-cerns (Feldt & Brennan, 1989).

Convergent and Discriminant Correlations WithExternal Variables

The external aspect of construct validity refers to the ex-tent to which the assessment scores' relationships withother measures and nonassessment behaviors reflect theexpected high, low, and interactive relations implicit inthe theory of the construct being assessed. Thus, the

meaning of the scores is substantiated externally by ap-praising the degree to which empirical relationships withother measures—or the lack thereof—are consistent withthat meaning. That is, the constructs represented in theassessment should rationally account for the externalpattern of correlations. Both convergent and discriminantcorrelation patterns are important, the convergent patternindicating a correspondence between measures of thesame construct and the discriminant pattern indicatinga distinctness from measures of other constructs (Camp-bell & Fiske, 1959). Discriminant evidence is particularlycritical for discounting plausible rival alternatives to thefocal construct interpretation. Both convergent and dis-criminant evidence are basic to construct validation.

Of special importance among these external rela-tionships are those between the assessment scores andcriterion measures pertinent to selection, placement, li-censure, program evaluation, or other accountabilitypurposes in applied settings. Once again, the constructtheory points to the relevance of potential relationshipsbetween the assessment scores and criterion measures,and empirical evidence of such links attests to the utilityof the scores for the applied purpose.

Consequences As Validity Evidence

The consequential aspect of construct validity includesevidence and rationales for evaluating the intended andunintended consequences of score interpretation and usein both the short- and long-term. Social consequences oftesting may be either positive, such as improved educa-tional policies based on international comparisons of stu-dent performance, or negative, especially when associatedwith bias in scoring and interpretation or with unfairnessin test use. For example, because performance assess-ments in education promise potential benefits for teachingand learning, it is important to accrue evidence of suchpositive consequences as well as evidence that adverseconsequences are minimal.

The primary measurement concern with respect toadverse consequences is that any negative impact on in-dividuals or groups should not derive from any source oftest invalidity, such as construct underrepresentation orconstruct-irrelevant variance (Messick, 1989b). In otherwords, low scores should not occur because the assessmentis missing something relevant to the focal construct that,if present, would have permitted the affected persons todisplay their competence. Moreover, low scores shouldnot occur because the measurement contains somethingirrelevant that interferes with the affected persons' dem-onstration of competence.

Validity as Integrative Summary

These six aspects of construct validity apply to all edu-cational and psychological measurement, including per-formance assessments. Taken together, they provide a wayof addressing the multiple and interrelated validity ques-tions that need to be answered to justify score interpre-tation and use. In previous writings, I maintained that itis "the relation between the evidence and the inferences

746 September 1995 • American Psychologist

Page 7: Validity of Psychological Assessmentpassage is well-known to some readers or the musical score for a sight reading exercise invokes a well-drilled rendition for some performers. Construct-irrelevant

drawn that should determine the validation focus" (Mes-sick, 1989b, p. 16). This relation is embodied in theo-retical rationales or persuasive arguments that the ob-tained evidence both supports the preferred inferencesand undercuts plausible rival inferences. From this per-spective, as Cronbach (1988) concluded, validation isevaluation argument. That is, as stipulated earlier, vali-dation is empirical evaluation of the meaning and con-sequences of measurement. The term empirical evalua-tion is meant to convey that the validation process is sci-entific as well as rhetorical and requires both evidenceand argument.

By focusing on the argument or rationale used tosupport the assumptions and inferences invoked in thescore-based interpretations and actions of a particulartest use, one can prioritize the forms of validity evidenceneeded according to the points in the argument requiringjustification or support (Kane, 1992; Shepard, 1993).Helpful as this may be, there still remain problems insetting priorities for needed evidence because the argu-ment may be incomplete or off target, not all the as-sumptions may be addressed, and the need to discountalternative arguments evokes multiple priorities. This isone reason that Cronbach (1989) stressed cross-argumentcriteria for assigning priority to a line of inquiry, such asthe degree of prior uncertainty, information yield, cost,and leverage in achieving consensus.

Kane (1992) illustrated the argument-based ap-proach by prioritizing the evidence needed to validate aplacement test for assigning students to a course in eitherremedial algebra or calculus. He addressed seven as-sumptions that, from the present perspective, bear on thecontent, substantive, generalizability, external, and con-sequential aspects of construct validity. Yet the structuralaspect is not explicitly addressed. Hence, the compen-satory property of the usual cumulative total score, whichpermits good performance on some algebra skills to com-pensate for poor performance on others, remains une-valuated in contrast, for example, to scoring models withmultiple cut scores or with minimal requirements acrossthe profile of prerequisite skills. The question is whethersuch profile scoring models might yield not only usefulinformation for diagnosis and remediation but also betterstudent placement.

The structural aspect of construct validity also re-ceived little attention in Shepard's (1993) argument-basedanalysis of the validity of special education placementdecisions. This was despite the fact that the assessmentreferral system under consideration involved a profile ofcognitive, biomedical, behavioral, and academic skills thatrequired some kind of structural model linking test resultsto placement decisions. However, in her analysis of se-lection uses of the General Aptitude Test Battery (GATB),Shepard (1993) did underscore the structural aspect be-cause the GATB within-group scoring model is both sa-lient and controversial.

The six aspects of construct validity afford a meansof checking that the theoretical rationale or persuasiveargument linking the evidence to the inferences drawn

touches the important bases; if the bases are not covered,an argument that such omissions are defensible must beprovided. These six aspects are highlighted because mostscore-based interpretations and action inferences, as wellas the elaborated rationales or arguments that attempt tolegitimize them (Kane, 1992), either invoke these prop-erties or assume them, explicitly or tacitly.

In other words, most score interpretations refer to rel-evant content and operative processes, presumed to be re-flected in scores that concatenate responses in domain-ap-propriate ways and are generalizable across a range of tasks,settings, and occasions. Furthermore, score-based interpre-tations and actions are typically extrapolated beyond thetest context on the basis of presumed relationships withnontest behaviors and anticipated outcomes or conse-quences. The challenge in test validation is to link theseinferences to convergent evidence supporting them and todiscriminant evidence discounting plausible rival inferences.Evidence pertinent to all of these aspects needs to be inte-grated into an overall validity judgment to sustain scoreinferences and their action implications, or else providecompelling reasons why there is not a link, which is whatis meant by validity as a unified concept.

Meaning and Values in Test ValidationThe essence of unified validity is that the appropriateness,meaningfulness, and usefulness of score-based inferencesare inseparable and that the integrating power derives fromempirically grounded score interpretation. As seen in thisarticle, both meaning and values are integral to the conceptof validity, and psychologists need a way of addressing bothconcerns in validation practice. In particular, what is neededis a way of configuring validity evidence that forestalls unduereliance on selected forms of evidence as opposed to a pat-tern of supplementary evidence, that highlights the impor-tant yet subsidiary role of specific content- and criterion-related evidence in support of construct validity in testingapplications. This means should formally bring considera-tion of value implications and social consequences into thevalidity framework.

A unified validity framework meeting these require-ments distinguishes two interconnected facets of validityas a unitary concept (Messick, 1989a, 1989b). One facetis the source of justification of the testing based on ap-praisal of either evidence supportive of score meaning orconsequences contributing to score valuation. The otherfacet is the function or outcome of the testing—eitherinterpretation or applied use. If the facet for justification(i.e., either an evidential basis for meaning implicationsor a consequential basis for value implications of scores)is crossed with the facet for function or outcome (i.e.,either test interpretation or test use), a four-fold classifi-cation is obtained, highlighting both meaning and valuesin both test interpretation and test use, as represented bythe row and column headings of Figure 1.

These distinctions may seem fuzzy because they arenot only interlinked but overlapping. For example, socialconsequences of testing are a form of evidence, and otherforms of evidence have consequences. Furthermore, to

September 1995 • American Psychologist 747

Page 8: Validity of Psychological Assessmentpassage is well-known to some readers or the musical score for a sight reading exercise invokes a well-drilled rendition for some performers. Construct-irrelevant

Figure 1Facets of Validity as a Progressive Matrix

Evidential

Basis

Consequential

Basis

Test Interpretation

Construct Validity (CV)

CV +

Value Implications (VI)

Test Use

CV + Relevance/Utility (R/U)

CV + R/U +

VI + Social Consequences

interpret a test is to use it, and all other test uses involveinterpretation either explicitly or tacitly. Moreover, utilityis both validity evidence and a value consequence. Thisconceptual messiness derives from cutting through whatindeed is a unitary concept to provide a means of dis-cussing its functional aspects.

Each of the cells in this four-fold crosscutting of uni-fied validity are briefly considered in turn, beginning withthe evidential basis of test interpretation. Because the ev-idence and rationales supporting the trustworthiness ofscore meaning are what is meant by construct validity,the evidential basis of test interpretation is clearly con-struct validity. The evidential basis of test use is also con-struct validity, but with the important proviso that thegeneral evidence supportive of score meaning either al-ready includes or becomes enhanced by specific evidencefor the relevance of the scores to the applied purpose andfor the utility of the scores in the applied setting, whereutility is broadly conceived to reflect the benefits of testingrelative to its costs (Cronbach & Gleser, 1965).

The consequential basis of test interpretation is theappraisal of value implications of score meaning, includ-ing the often tacit value implications of the construct labelitself, of the broader theory conceptualizing constructproperties and relationships that undergirds constructmeaning, and of the still broader ideologies that give the-ories their perspective and purpose—for example, ide-ologies about the functions of science or about the natureof the human being as a learner or as an adaptive or fullyfunctioning person. The value implications of score in-terpretation are not only part of score meaning, but asocially relevant part that often triggers score-based ac-tions and serves to link the construct measured to ques-tions of applied practice and social policy. One way toprotect against the tyranny of unexposed and unexaminedvalues in score interpretation is to explicitly adopt mul-tiple value perspectives to formulate and empirically ap-praise plausible rival hypotheses (Churchman, 1971;Messick, 1989b).

Many constructs such as competence, creativity, in-telligence, or extraversion have manifold and arguablevalue implications that may or may not be sustainablein terms of properties of their associated measures. Acentral issue is whether the theoretical or trait implicationsand the value implications of the test interpretation are

commensurate, because value implications are not an-cillary but, rather, integral to score meaning. Therefore,to make clear that score interpretation is needed to ap-praise value implications and vice versa, this cell for theconsequential basis of test interpretation needs to com-prehend both the construct validity as well as the valueramifications of score meaning.

Finally, the consequential basis of test use is the ap-praisal of both potential and actual social consequencesof the applied testing. One approach to appraising po-tential side effects is to pit the benefits and risks of theproposed test use against the pros and cons of alternativesor counterproposals. By taking multiple perspectives onproposed test use, the various (and sometimes conflicting)value commitments of each proposal are often exposedto open examination and debate (Churchman, 1971;Messick, 1989b). Counterproposals to a proposed test usemight involve quite different assessment techniques, suchas observations or portfolios when educational perfor-mance standards are at issue. Counterproposals mightattempt to serve the intended purpose in a different way,such as through training rather than selection when pro-ductivity levels are at issue (granted that testing may alsobe used to reduce training costs, and that failure in train-ing yields a form of selection).

What matters is not only whether the social conse-quences of test interpretation and use are positive or negative,but how the consequences came about and what determinedthem. In particular, it is not that adverse social consequencesof test use render the use invalid but, rather, that adversesocial consequences should not be attributable to any sourceof test invalidity, such as construct underrepresentation orconstruct-irrelevant variance. And once again, in recognitionof the fact that the weighing of social consequences bothpresumes and contributes to evidence of score meaning, ofrelevance, of utility, and of values, this cell needs to includeconstruct validity, relevance, and utility, as well as socialand value consequences.

Some measurement specialists argue that addingvalue implications and social consequences to the validityframework unduly burdens the concept. However, it issimply not the case that values are being added to validityin this unified view. Rather, values are intrinsic to themeaning and outcomes of the testing and have alwaysbeen. As opposed to adding values to validity as an ad-junct or supplement, the unified view instead exposes theinherent value aspects of score meaning and outcome toopen examination and debate as an integral part of thevalidation process (Messick, 1989a). This makes explicitwhat has been latent all along, namely, that validity judg-ments are value judgments.

A salient feature of Figure 1 is that construct validityappears in every cell, which is fitting because the constructvalidity of score meaning is the integrating force that un-ifies validity issues into a unitary concept. At the sametime, by distinguishing facets reflecting the justificationand function of the testing, it becomes clear that distinctfeatures of construct validity need to be emphasized, inaddition to the general mosaic of evidence, as one moves

748 September 1995 • American Psychologist

Page 9: Validity of Psychological Assessmentpassage is well-known to some readers or the musical score for a sight reading exercise invokes a well-drilled rendition for some performers. Construct-irrelevant

from the focal issue of one cell to that of the others. Inparticular, the forms of evidence change and compoundas one moves from appraisal of evidence for the constructinterpretation per se, to appraisal of evidence supportiveof a rational basis for test use, to appraisal of the valueconsequences of score interpretation as a basis for action,and finally, to appraisal of the social consequences—or,more generally, of the functional worth—of test use.

As different foci of emphasis are highlighted in ad-dressing the basic construct validity appearing in eachcell, this movement makes what at first glance was a sim-ple four-fold classification appear more like a progressivematrix, as portrayed in the cells of Figure 1. From oneperspective, each cell represents construct validity, withdifferent features highlighted on the basis of the justifi-cation and function of the testing. From another per-spective, the entire progressive matrix represents constructvalidity, which is another way of saying that validity is aunified concept. One implication of this progressive-ma-trix formulation is that both meaning and values, as wellas both test interpretation and test use, are intertwinedin the validation process. Thus, validity and values areone imperative, not two, and test validation implicatesboth the science and the ethics of assessment, which iswhy validity has force as a social value.

REFERENCES

Brunswik, E. (1956). Perception and the representative design of psychologicalexperiments (2nd ed.). Berkeley: University of California Press.

Campbell, D. T., & Fiske, D. W. (1959). Convergent and discriminantvalidation by the multitrait-multimethod matrix. Psychological Bul-letin, 56, 81-105.

Cascio, W. E, Outtz, J., Zedeck, S., & Goldstein, I. L. (1991). Statisticalimplications of six methods of test score use in personnel selection.Human Performance, 4, 233-264.

Churchman, C. W. (1971). The design of inquiring systems: Basic con-cepts of systems and organization. New York: Basic Books.

Cook, T. D., & Campbell, D. T. (1979). Quasi-experimentation: Designand analysis issues for field settings. Chicago: Rand McNally.

Cronbach, L. J. (1971). Test validation. In R. L. Thorndike (Ed.), Ed-ucational measurement (2nd ed., pp. 443-507). Washington, DC:American Council on Education.

Cronbach, L. J. (1988). Five perspectives on validation argument. In H.Wainer & H. Braun (Eds.), Test validity (pp. 34-35). Hillsdale, NJ:Erlbaum.

Cronbach, L. J. (1989). Construct validation after thirty years. In R. L.Linn (Ed.), Intelligence: Measurement, theory, and public policy (pp.147-171). Chicago: University of Illinois Press.

Cronbach, L. J., & Gleser, G. C. (1965). Psychological tests and personneldecisions (2nd ed.). Urbana: University of Illinois Press.

Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in psychologicaltests. Psychological Bulletin, 52, 281-302.

Cureton, E. E. (1951). Validity. In E. F. Lindquist (Ed.), Educationalmeasurement (1st ed., pp. 621-694). Washington, DC: AmericanCouncil on Education.

Embretson, S. (1983). Construct validity: Construct representation versusnomothetic span. Psychological Bulletin, 93, 179-197.

Feldt, L. S., & Brennan, R. L. (1989). Reliability. In R. L. Linn (Ed.),Educational measurement (3rd ed., pp. 105-146). New York: Mac-millan.

Frederiksen, N., Mislevy, R. J., & Bejar, I. (Eds.). (1993). Test theory fora new generation of tests. Hillsdale, NJ: Erlbaum.

Gottfredson, L. S. (1994). The science and politics of race-norming.American Psychologist, 49, 955-963.

Guion, R. M. (1976). Recruiting, selection, and job placement. InM. D. Dunnette (Ed.), Handbook of industrial and organizationalpsychology (pp. 777-828). Chicago: Rand McNally.

Gulliksen, H. (1950). Intrinsic validity. American Psychologist, 5, 511-517.Hartigan, J. A., & Wigdor, A. K. (Eds.). (1989). Fairness in employment

testing: Validity generalization, minority issues, and the General Ap-titude Test Battery. Washington, DC: National Academy Press.

Holland, P. W., & Wainer, H. (Eds.). (1993). Differential item functioning.Hillsdale, NJ: Erlbaum.

Hunter, J. E., Schmidt, F. L., & Jackson, C. B. (1982). Advanced meta-analysis: Quantitative methods of cumulating research findings acrossstudies. San Francisco: Sage.

Kane, M. T. (1992). An argument-based approach to validity. Psycho-logical Bulletin, 112, 527-535.

Lennon, R. T. (1956). Assumptions underlying the use of content validity.Educational and Psychological Measurement, 16, 294-304.

Loevinger, J. (1957). Objective tests as instruments of psychological theory[Monograph]. Psychological Reports, 3, 635-694 (Pt. 9).

Messick, S. (1964). Personality measurement and college performance.In Proceedings of the 1963 Invitational Conference on Testing Problems(pp. 110-129). Princeton, NJ: Educational Testing Service.

Messick, S. (1975). The standard problem: Meaning and values in mea-surement and evaluation. American Psychologist, 30, 955-966.

Messick, S. (1980). Test validity and the ethics of assessment. AmericanPsychologist, 35, 1012-1027.

Messick, S. (1989a). Meaning and values in test validation: The scienceand ethics of assessment. Educational Researcher, 18(2), 5-11.

Messick, S. (1989b). Validity. In R. L. Linn (Ed.), Educational mea-surement (3rd ed., pp. 13-103). New "York: Macmillan.

Messick, S. (1994). The interplay of evidence and consequences in thevalidation of performance assessments. Educational Researcher, 23(2),13-23.

National Council of Teachers of Mathematics. (1989). Curriculum andevaluation standards for school mathematics. Reston, VA: Author.

Peak, H. (1953). Problems of observation. In L. Festinger & D. Katz(Eds.), Research methods in the behavioral sciences (pp. 243-299).Hinsdale, IL: Dryden Press.

Rulon, P. J. (1946). On the validity of educational tests. Harvard Ed-ucational Review, 16, 290-296.

Sackett, P. R., & Wilk, S. L. (1994). Within-group norming and otherforms of score adjustment in preemployment testing. American Psy-chologist, 49, 929-954.

Schmidt, F. L. (1991). Why all banding procedures are logically flawed.Human Performance, 4, 265-278.

Shepard, L. A. (1993). Evaluating test validity. Review of research ineducation, 19, 405-450.

Shulman, L. S. (1970). Reconstruction of educational research. Reviewof Educational Research, 40, 371-396.

Snow, R. E. (1974). Representative and quasi-representative designs forresearch on teaching. Review of Educational Research, 44, 265-291.

Snow, R. E., & Lohman, D. F. (1989). Implications of cognitive psy-chology for educational measurement. In R. L. Linn (Ed.), Educationalmeasurement (3rd ed., pp. 263-331). New York: Macmillan.

Suppes, P., Pavel, M., & Falmagne, J.-C. (1994). Representations andmodels in psychology. Annual Review of Psychology, 45, 517-544.

Wiggins, G. (1993). Assessment: Authenticity, context, and validity. PhiDelta Kappan, 75, 200-214.

September 1995 • American Psychologist 749