the criterion problem: what measure of success in graduate … · 2016. 5. 18. · 281 the...

281

The Criterion Problem:What Measure of Success in Graduate Education?

Rodney T. Hartnett and Warren W. WillinghamEducational Testing Service

A wide variety of potential indicators of graduatestudent performance are reviewed. Based on ascrutiny of relevant research literature and ex-perience with recent and current research projects,the various indicators are considered in two ways.First, they are analyzed within the framework of thetraditional "criterion problem," that is, with re-spect to their adequacy as criteria in predictinggraduate school performance. In this case, empha-sis is given to problems with the criteria that makeit difficult to draw valid inferences about the re-

lationship between selection measures and perfor-mance measures. Second, the various indicators areconsidered as an important process of the graduateprogram. In this case, attention is given to theiradequacy as procedures for the evaluation of stu-dent performance, e.g., their clarity, fairness, andusefulness as feedback to students.

In any educational program a primary ques-tion is how to define successful performance.The so-called &dquo;criterion problem&dquo; has alwaysbeen an important issue in validating admis-sions measures; what constitutes success also

has a critical bearing on the very conception of aprogram and its objectives. Nonetheless, there islimited literature on the problem as it applies tograduate study. In fact, Hirschberg and Itkin(1978) recently asserted, &dquo; ... there has been

practically no attempt whatsoever at a thorough

APPLIED PSYCHOLOGICAL MEASUREMENTVol. 4, No. 3Summer1980pp. 281-291@ Copyright 1980 West Publishing Co.

theoretical criterion analysis of graduate schoolsuccess&dquo; (p. 1085).Notions of what constitutes successful grad-

uate student performance and how it ought to bemeasured naturally vary widely across insti-

tutions, disciplines, and types of programs. As aresult, there is often ambiguity in the meaning of&dquo;success&dquo; in graduate school, and a correspond-ing set of issues and questions that must be ad-dressed when embarking on research-especial-ly validity studies research-that relies heavilyon graduate school performance as a criterion.Therefore, an overview of the criterion problemas it applies to graduate education would seemto be much overdue. This review distinguishesthree broad classes of criterion measures: tradi-tional criteria (e.g., grades, examination perfor-mance), evidence of professional accomplish-ment (e.g., publications, awards), and speciallydeveloped criteria (e.g., work samples, ratings).

Traditional Criteria

A number of criteria have been used for yearsin the assessment of graduate student perfor-mance. These criteria include such indices of

. student competence as grades, performance onqualifying and/or comprehensive examinations,degree status (progress toward the degree,whether one eventually earns the degree), anddissertation quality.

Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227. May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction

requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/

282

Grades

When people speak of success in graduateschool or &dquo;the criterion&dquo; of successful graduatestudent performance, more often than not theyare referring to grades in one form or another.Along with the criterion of degree attainment,grades have been used more than any other cri-terion in studies of graduate school success orvalidity of the Graduate Record Examinations(GRE; Willingham, 1974).As an indicator of student performance,

grades have several positive qualities. First, theyare usually readily available for virtually all stu-dents and therefore make a very convenient cri-terion. In fact, in a recent validity studies projectcarried out in cooperation with more than 30graduate schools, Wilson (1978) reports thatthe first-year grade-point average is the only cri-terion that is common to all institutions. In ad-

dition, grade-point averages seem to represent agood composite of whatever kinds of academicperformance are reflected in grades, since varia-tion in student performance across a large num-ber of courses can be accounted for fairly well byone general achievement factor (Boldt, 1970;French, 1951). Further evidence that it is reason-able to treat grades as representing a single gen-eral kind of academic performance is availablefrom studies at the undergraduate level (e.g.,Clark, 1964; Barritt, 1966). Thus, even thoughthere is both empirical and anecdotal evidencethat different teachers weight student qualitiesdifferently when assessing student perfor-mance-qualities, for example, such as studenteffort, amount of improvement during the term,clarity of expression, and level of curiosity-itnevertheless appears that a large part of the in-formation in grade averages can be explained bysome unidimensional concept.Another advantage often claimed for grades is

their stability or consistency, that is, studentswho earn high grades during the first term aremore likely to earn higher grades during laterterms. This is definitely true at the under-grad-uate level, though perhaps not so dramaticallyas many observers might think; although the

similarity in academic performance betweenback-to-back academic terms is fairly high (withcorrelations between adjacent-terms grades of-ten running in the .60’s and .70’s), grades overan extended period of time are much less stable(Humphreys, 1968; Juola, 1964). At the graduatelevel, evidence regarding the stability of gradesis more difficult to find. It is clear that there isless fluctuation in grades at this point simply be-cause almost all students receive A’s and B’s,but such consistency does not necessarily implyreliable measurement.

Difficulties with grades as a criterion in assess-ing student performance are numerous. Onetechnical difficulty is that the narrow range in

grades assigned attenuates the magnitude of

validity coefficients when grades are employedas the criterion in prediction studies. More im-portantly, the restricted range means that gradedifferences among students do not fully repre-sent the range of differences in student accom-

plishment. Grades at the graduate level thusmay not provide meaningful descriptions of dif-ferential student performance.A second shortcoming of grades is the obvious

fact that grading standards can and do vary dra-matically and sometimes arbitrarily across disci-plines and within disciplines across different in-stitutions (Bowers, 1967; Goldman & Slaughter,1976; Juola, 1968). As a result, grades are practi-cally useless as a criterion for multi-institutionalcomparative studies of student performance.Additionally, different grading standards meansthat special statistical techniques are necessary(Wilson, 1978) in order to combine data acrossinstitutions (within the same discipline) for

validity studies, a strategy that is sometimes de-sirable owing to the small number of studentswithin one department. Pooled data that doesnot make adjustment for such scale differencescan sometimes result in an overall negative rela-tionship between the predictor and the criterion,even when the &dquo;true&dquo; relationship, as revealed inthe various single-department (nonpooled)analyses, is positive.A third difficulty with grades as a criterion is

that it is not always clear what grades mean. Dif-Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227. May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction


283

ferent professors value different types ofachievement. In spite of the finding cited earlierthat course grades can be accounted for by afairly general achievement factor (Boldt, 1970),it is at the same time true that grade assignmentis sometimes unduly influenced by student char-acteristics that bear no clear relationship to aca-demic performance, such as gregariousness(Singer, 1964), gender (Caldwell & Hartnett,1967), or various manipulative strategies (San-ford, 1976). Furthermore, first-year grades in

graduate school have been found to be onlyslightly related to eventual success in doctoralwork in psychology (Hackman, Wiggins, & Bass,1970), and it is likely that the basis for grading isquite different before and after students are ac-cepted to formal candidacy.

Degree Attainment

Degree attainment has been employed in va-lidity studies as often as the grade-point average(Willingham, 1974) and is a useful and impor-tant criterion oi graduate student performance.It is generally regarded as the single most impor-tant criterion of success by many, if not most,observers. Those who take this position arguethat all other administrative criteria-such as

grades or faculty ratings-are simply poor prox-ies for what really counts: namely, whether thestudent eventually earned his or her degree.Graduate students clearly regard it as the mostimportant outcome of their graduate studies.As a criterion for the study of graduate stu-

dent performance, however, degree attainmenthas certain limitations. One limitation is thatstudents drop out of graduate school for a hostof reasons, many of which have little or nothingto do with competence or academic ability. Re-search indicates that graduate students fre-

quently withdraw for reasons having to do withemotional problems (Halleck, 1976); poor rela-tions with their faculty advisor (Heiss, 1970);family, health, or financial problems (Tucker,Gottlieb, & Pease, 1964); and so on. Worse yet,the real reason for withdrawing, however, may

never be learned. As Berelson (1960) has pointedout, there is sometimes-how frequently we can-not say-a discrepancy between the real reasonand the reason reported by those withdrawing.According to Berelson, &dquo;What is critical frank-ness between doctoral candidates becomes in thedean’s office lack of funds or personal change ofplans&dquo; (p. 170).Another shortcoming of degree attainment as

a criterion is the fact that most graduate pro-grams keep very inadequate records about attri-tion (Clark, Hartnett, & Baird, 1976). The factthat departmental records are usually inade-quate in this area is understandable, for thewhole question of defining a doctoral-level drop-out is not at all simple. As hinted earlier, oftencases of dropping out at the doctoral level areless matters of a definite, formal decision on thestudents’ part than a long-term indecision pro-cess that results in failure to re-enroll and which,after a time lapse of several years, is recognizedas a de facto withdrawal without any kind of of-ficial (or sometimes even informal) communica-tion of intent.

Time to the Degree

The time-span between beginning doctoral

study and completing the requirements for thedegree has been a much-criticized aspect of ad-vanced study in this country. The time-spanproblem has received considerable attention I

from researchers, both at the national level

(National Academy of Sciences, 1967; Tucker,Gottlieb, & Pease, 1964; Wilson, 1965); and atvarious doctoral-granting universities such as

Columbia (Rosenhaupt, 1958), Michigan(Bretsch, 1965; Heine, 1976), and Harvard

(Doermann, 1968).Time-to-the-degree data have been used oc-

casionally as a criterion in studies of success ingraduate school. Willingham (1974), for exam-ple, found more than a dozen studies in whichtime-to-the-degree was used as a criterion in

prediction studies with GRE test scores. Howlong it takes one to earn the degree, like degree



284

attainment, does have a certain rational appeal.The speed with which one accomplishes complextasks has always commanded respect in aca-demic circles, and it is probably reasonable tosurmise that, within a given discipline, those whocomplete all degree requirements in three yearsare more able, on the average, than those whotake six years. The major drawback of time-to-the degree as a criterion, however, is that thereasons for taking longer are often ones overwhich the student has little or no control. It istrue that some students do not earn the degreesooner because they can not, simply becausethey have difficulties meeting the requirementsfor the degree. In these cases, time-to-the-degreewould appear to be a clear function of intellec-tual ability, willingness to work, &dquo;staying pow-er,&dquo; or other similar characteristics and is thus alogical criterion. In many other cases, time-to-the-degree is a function of financial stringency(requiring the student to work at somethingother than completing the dissertation, for ex-ample), difficulties with dissertation committees(especially in the form of prolonged absencesfrom the campus), and other nonacademic fac-tors (Katz & Hartnett, 1976). Berelson (1960)even suggests that some students actually arenot allowed to finish sooner because &dquo;... theyare needed as teaching assistants for the depart-ment or as research assistants for the professor&dquo;(p. 162).

Comprehensive Examinations

The nature and form of comprehensiveexaminations varies considerably, both acrossdisciplines and across institutions within a dis-cipline. More often than not, however, the term&dquo;comprehensive examinations&dquo; applies to an

examination or set of examinations-usuallywritten, occasionally oral-that follows thestudent’s completion of formal course work atthe graduate level and is used to determine thestudent’s mastery of research in the field andeligibility for formal degree candidacy in the de-partment. (In some institutions these are refer-

red to as &dquo;qualifying examinations.&dquo; With someexceptions, the only difference is the name, notthe timing or basic purpose of the test. There-fore, the terms &dquo;comprehensive examinations&dquo;and &dquo;qualifying examination&dquo; will be used inter-changeably.)One of the most commonly criticized weak-

nesses of comprehensive examinations is the fre-quent departmental uncertainty and lack of

specificity about the purpose of comprehensiveexaminations and, consequently, their basicform and content. As one critic observed, &dquo;...

graduate departments in many cases have neverdefined for themselves, much less for the stu-dents, what ground the examination shouldcover and how to go about preparing for it&dquo;

(Carmichael, 1961, p. 149). Apparently, this ob-servation, made nearly 20 years ago, is still anaccurate description of the status of doctoral-level comprehensive examinations. There issome evidence to suggest that some departmentsdo not take the comprehensive examinations veryseriously and are apparently not very concernedabout taking steps that would make them morereliable and meaningful measures of student at-tainments (Berelson, 1960; Heiss, 1970; Mayhew& Ford, 1974). And, in addition to difficultieswith the purposes and content of comprehensiveexaminations, evidently few graduate depart-ments have given serious attention to the ques-tion of how to grade such exams, which are al-most always in essay or expository form. As a re-sult, it may well be the case-there is no evi-dence for this assertion-that evaluations of stu-dent comprehensive examination performanceare often not very reliable.In spite of these shortcomings, many graduate

faculty members appear to be reluctant to con-sider more systematic procedures for the assess-ment of student academic attainment (Carlson,Evans, & Kuykendall, 1973). Therefore, the sug-gestion that graduate faculties should specifythe competencies they expect of students andconstruct examinations to test whether those

competencies have been achieved is seldom giv-en serious consideration, in spite of the success



285

of such practices in several professional fields(e.g., McGuire & Babbott, 1967; Rimoldi, 1963).

Dissertation Quality

Evaluation of dissertation quality and the de-cision to award the degree are necessarily some-what subjective. It is surprising, however, thatthe dissertation has not received more attentionin validity studies and formal evaluations of

graduate programs. The dissertation uniformlystands as the primary piece of evidence that astudent can conduct sound scholarly and re-search endeavors, and the evidence is clear thatthe dissertation is highly valued by both studentsand faculty (Berelson, 1960; Porter & Wolfle,1975). As further evidence of their general es-teem, many disciplines conduct annual competi-tions to identify particularly outstanding dis-sertations. Nevertheless, with the exception ofseveral nonresident doctoral programs (e.g.,Medsker & Wattenbarger, 1976; Meeth & Wat-

tenbarger, 1974), there has been scant attentiongiven to the dissertation as an indicator of doc-toral student performance.There are several positive aspects of the dis-

sertation as a criterion. First, as already indicat-ed, it is regarded as the central test of ability tocarry out scholarly activities. Furthermore, ithas a &dquo;real-life&dquo; appeal that is undeniable, for inits properly monitored form, it tests the extent towhich students can conceive and carry out ac-tivities that are expected to occupy a substantialpart of their professional lives.On the other hand, it is clear that numerous

practical difficulties would be encountered in

any attempt to employ dissertation quality as acriterion of research competence. For example,it would require the use of carefully selected ob-jective panels of readers in the discipline, each ofwhom would have to read numerous disserta-tions and to make ratings on standard, carefullyconstructed dimensions of quality. Thus, anysuch undertaking would be fairly expensive andtime consuming. In addition, it is sometimes dif-ficult to know just what portion of the disserta-

tion represents the work of the doctoral studentand what portion is the work of the student’s

major professor. The final writing of the thesis,to be sure, can almost always be assumed to bethe work of the student. But what about the ma-

jor conceptual orientation or hypothesis of theinquiry, the basic research design or strategy, orthe methods chosen to analyze the data? To someextent, these considerations are always in-fluenced by a student’s dissertation chairpersonand committee members. Berelson (1960), forexample, reports that graduate students selecttheir own dissertation topics very rarely-lessthan 10% of the time, in fact, in the humanitiesand sciences. The problem is that even within adepartment, students are influenced unevenly,and therefore the extent to which the disserta-tion serves as a true measure of the student’s re-search competence is not always clear.

Informal Criteria

One final observation is in order before the re-

view of administrative criteria is concluded.Each of these criteria is &dquo;formal,&dquo; in the sensethat the evaluations tend to occur at prescribeddates (e.g., comprehensive examinations) or overa prescribed period of time (e.g., a course grade),and some summary result of the evaluation isthen transmitted to the student so that it be-comes part of both the student’s and the depart-ment’s official record. It needs to be recognizedthat a good deal of the evaluation process ingraduate education-just how much cannot besaid-operates in a more informal fashion andits results never become part of any formal rec-ord. For example, many faculty members formopinions or make judgments about students af-ter contacts with them over a long period of timein a variety of settings. Then, on the basis ofthese gradually formed opinions, they give lesssupport and encauragement to the less able stu-dents in the form of personal communicationsand contacts, invitations to join in collaborativeresearch efforts, opportunities for teaching andresearch assistantships, unwillingness to serve



286

on dissertation committees, discouragementduring work on the dissertation, and the like(Katz & Hartnett, 1976; Sanford, 1976). Thecourse grades for these students may be accept-able (presumably because some faculty membersare reluctant to assign poor grades due to thefact that their assessments will not be anony-mous), and their performance on the compre-hensive examination may have been acceptable(after several attempts); but, by means of theseother more subtle mechanisms, such studentsare gradually &dquo;cooled out&dquo; of graduate study.To the extent such informal assessments actual-

ly occur and are communicated to students, it

suggests that the administrative criteria re-

viewed here provide an incomplete picture of theway graduate student performance is evaluated.

Evidence of Professional Accomplishment

A substantial body of research literature hasdeveloped in recent years that deals with studentaccomplishments. Basically, this research indi-cates that self-reported accomplishments at oneeducational level (secondary school, for ex-

ample) tend to predict similar accomplishmentsat a later educational level (e.g., college). Per-haps the best evidence comes from the NationalMerit Scholarship Corporation, which reporteda series of studies in the 1960s clearly indicatingthat the best predictor of a specific nonacademicaccomplishment in college (e.g., composing orarranging music that was publicly performed,getting elected to one or more student offices)was accomplishment in that same (or a verysimilar) area in secondary school, as measuredby a simple student self-report from a checklist(Holland & Nichols, 1964; Nichols & Holland,1963). Even more striking was the finding thatsuch specific accomplishments are not accurate-ly predicted by such standard academic indicesas grades or verbal aptitude test scores (Baird &

Richards, 1968; Richards, Holland & Lutz,1967; Wing & Wallach, 1971).Most of these efforts have concentrated on the

prediction of undergraduate performance on the

basis of secondary school (or nonschool) ac-

complishments. Recently, however, Baird(1976) developed an experimental inventory ofundergraduate accomplishments that might beused in graduate school admissions. Compara-ble self-report forms could, of course, be devel-oped for use in documenting graduate-level ac-complishments that might reasonably be expect-ed to occur during the student’s graduate careerand would be relevant to scholarly or pro-fessional performance.Such indices of professional behavior have

considerable merit as criteria because they re-flect important long-term objectives. However, ifsuch measures are to be seriously considered asa graduate student performance criterion, rou-tine procedures for the collection of such infor-mation would seem to be essential. Currently,such student accomplishment information is

rarely kept in any systematic way in depart-mental files. In addition, there would be severalsignificant limitations with student professionalaccomplishments. One is that such accomplish-ments may be partly a matter of simple luck.Some graduate students publish journal articlesas joint authors or coauthor papers presentedat professional meetings because they happen tobe fortunate enough to be associated with a ma-jor professor who is nurturant and supportive inthis regard, whereas other students are perhapsequally competent but do not receive the sameencouragement or assistance. Similarly, whileone student’s work on a project may.result in co-authorship or a journal article with one profes-sor, an even more substantial contribution from

another student may not even earn an &dquo;I am in-

debted&dquo; footnote. To the extent that such differ-

ences are commonplace, these kinds of studentbehaviors are misleading as indices of individualstudent accomplishment.A second difficulty with the professional ac-

complishments criterion is that the distributionof such accomplishments will be extremely nar-row and skewed. At least this is true at the un-

dergraduate level (Baird, 1978), and it is almostsurely the case in graduate school as well. This



287

does not affect the logic of using accomplish-ments as a criterion, of course, but it does re-duce their likely utility.

Specially Constracted Criteria

In addition to traditional criteria and student

professional accomplishments, there is a third

category of criterion information that needs tobe considered. These are specially constructedmeasures of various critical competencies re-

garded as important outcomes of advanced

training, but outcomes or competencies whichare rarely assessed in any systematic way bymost graduate programs. In this section twotypes of constructed criteria are considered: rat-ing scales and performance work samples.

Rating Scales

Global faculty ratings of graduate student per-formance have been used as a criterion measurein a fair number of validity studies, though theyhave not been employed in this way nearly so of-ten as grades or degree attainment (Willingham,1974). It would appear that ratings are an ac-ceptable criterion measure, at least in manyfields of graduate education (Carlson, Evans, &

Kuykendall, 1973).One advantage of ratings is that they are rela-

tively easy to obtain, thus providing a fairly con-venient criterion. Unfortunately, however, rat-ings still suffer from several serious shortcom-

ings. Perhaps the most troublesome problemwith ratings as a criterion of graduate studentperformance is simply that many members ofthe faculty will not be sufficiently familiar withthe student’s work to be able to make an in-formed rating. This was evident in research con-ducted in graduate business schools (Hilton,Kendall, & Sprecher, 1970) and would seemlikely to be characteristic of other graduate pro-grams as well.In addition, ratings have often been beset with

problems of leniency and range restriction

(Reilly, 1974a). And though efforts to improve

ratings through critical incident techniques diddistinguish a small number of separate factorscomprising graduate student performance (e.g.,independence and initiative, conscientiousness,critical facility) in chemistry, English, and psy-chology (Reilly, 1974b), subsequent research re-vealed that scales developed to obtain ratings ofthese separate factors were highly intercor-related and had only minimal reliability (Carl-son, Reilly, Mahoney, & Casserly, 1976). Thehigh intercorrelations were confirmed in re-

search on undergraduate students, where it wasfound that faculty ratings of students are heavilydominated by an academic performance factor,as defined by grades (Davis, 1965).Perhaps the most effective ratings scales are

those that define the extremes of the behaviorbeing observed and, if possible, also provide de-scriptions of intermediate points along the con-tinuum. Such &dquo;behaviorally anchored&dquo; ratingscales hold promise, but the utility of such mea-sures depends heavily on the experience of theraters and the thoroughness with which theyhave been trained. Even with careful training,however, a &dquo;halo&dquo; effect-that is, the tendencyfor an observer’s general impression to influencehis/her ratings of specific behaviors-and otherforms of contamination are frequently difficultto eliminate when rating scales are used (Brog-den & Taylor, 1950; Glaser & Klaus, 1962).Davis’ (1965) finding that faculty ratings of vari-ous traits of undergraduate students are all

highly correlated with student grades is againrelevant in this regard.In certain respects, ratings have always been a

fairly important aspect of student evaluation ingraduate education and are likely to remain so.Grades, for example, are a form of ratings in oneacademic course (see discussion of the shortcom-ings of grades as a performance criterion earlierin the paper), and letters of recommendation an-other. Letters of recommendation, however, arealmost always written by someone chosen by thestudent and therefore, presumably, by someonevery familiar with the student’s work andabilities. Some departments apparently employ



288

global faculty ratings in the process of makingcertain internal decisions (e.g., about studentassistantships or certain field work experiences),but these ratings are rarely done in a very formalway involving the ratings of the entire facultywithin the department. Given the problems ofthe lack of faculty contact with some students,rater unreliability, and halo effect, it seems un-likely that global faculty ratings will ever be-come an important or widely used criterion ofgraduate student performance.One final aspect of ratings deserves to be men-

tioned. For research purposes, peer (fellow-stu-dent) ratings should not be overlooked, for theyhave been found to be promising predictors ofsubsequent performance, both in and outside ofeducation. The usefulness of peer ratings forpredicting success in the military was demon-strated many years ago (e.g., Bryant, 1956;Tupes, 1957a, 1957b); more recently, their poten-tial in educational settings was suggested whenit was found that peer ratings of nonintellectivetraits were superior,to both academic aptitudeand self-report measures in the prediction offirst-year performance in college (Smith, 1967).At the graduate level the research on the

utility of peer ratings has been infrequent butencouraging. Kelly and Fiske (1951) found peerratings of clinical psychology trainees to be onlyslightly less accurate in predicting later successthan ratings by trained psychologists, Eisenberg(1965) found peer ratings to be highly correlatedwith performance on comprehensive examina-tions in one doctoral program, and Wiggins andBlackburn (1969) found peer ratings to be betterpredictors of first-year performance in psycholo-gy at one institution than a host of other moretraditional predictors.

Performance Work SamplesFrederiksen (1977) has argued that before

developing a measure of the effectiveness of atraining program, the kinds of skills and behav-iors to be expected from those who have experi-enced the training need to be understood first;

one way to accomplish this is to analyze the sortsof things graduates of these training programswill be doing in their subsequent careers and oc-cupations. One may understandably hesitate atthe suggestion that a clearer understanding ofthe specific behaviors and activities that will beexpected of the graduates of most graduate pro-grams is needed. These people, after all, will beemployed in positions requiring very complexbehaviors and skills, ones neither easily definednor simply described. It is one thing to give aprecise description of the specific job activitiesof a lathe operator, quite another to so easily saywhat a college professor does. But minute de-tailed descriptions are really not necessary. AsCronbach (1970) has argued, what is needed isnot a test that will sample the criterion task ex-actly, &dquo;... but the general type of intellectual ormotor performance required by the criteriontask&dquo; (p. 199).The primary purpose of the great majority of

doctoral programs in this country is to preparescholars and researchers (Clark, Hartnett, &

Baird, 1976). Purposes such as the preparationof future teachers or practitioners are also

acknowledged, but are not regarded as beingnearly so important. In considering ways todevelop additional systematic assessment meth-ods, it is quite reasonable, then, to focus on thestudent’s ability to carry out research and to rec-ognize the merits and deficiencies in the re-

search reported by others. One possible way toassess the former is by closer, more objectiveevaluations of dissertations, as suggested above.An alternative way is by means of a speciallyconstructed measure for each discipline thatwould directly assess important aspects of stu-dent research performance in a standardizedtask.

Research on the development of measures ap-propriate for use as criterion measures in ad-vanced training programs is not a new area ofinquiry. Previous research has resulted in thedevelopment of The Tests of Scientific Thinking(TST), which are free-response job-sample testssimulating tasks that might be encountered by a



289

behavioral scientist (Frederiksen 8~ Ward, 1978;Ward & Frederiksen, 1977). Research with theexperimental TST has indicated that the variousTST subtests are not highly correlated withGRE scores and were more highly correlatedthan GRE tests with student self-reported pro-fessional accomplishments. These data suggestthat relevant criterion measures for graduatestudent performance can be developed that arenot simply extensions of traditional verbal skillsmeasures.

As with the other performance criteria, per-formance work samples have a number of

limitations. For one thing, it would be difficult,if not impossible, to design a work-sample mea-sure that would be appropriate to all Ph.D. can-didates in a program or even in a branch of a

discipline. Because of the apprenticeship natureof the graduate experience, in certain disciplinesthere is often little substance in common amongthe various subspecialties. Another problem isthat such a standardized criterion measure

might have the unfortunate effect of pressuringdepartments toward greater uniformity in theircurricula. Given current student assessment

procedures, within-department diversity is per-mitted to thrive, with less popular subspecialtiesoften going their own way, eschewing pressuresto adopt &dquo;the latest methodologies.&dquo; In effect,this is an expression of concern about the extentto which a standardized criterion measure would

gradually become the definer or undue influenc-er of the nature of graduate school curricula.

Conclusions

This paper has attempted a general analysis ofthe strengths and weaknesses of a fairly largenumber of criteria that have been or might beused to evaluate graduate (especially doctoral)student performance. Though the intention wasto examine these criteria primarily in terms oftheir adequacy as dependent variables for thevalidation of graduate school selection proce-dures, it is apparent that each separate criterioncan be fully understood only after considerationof the more general process of graduate student

evaluation. As a result, though this review dealsprimarily with the criterion problem in the tradi-tional measurement sense, it also, to some ex-tent, provides an overview of student perfor-mance evaluation in graduate education.Perhaps the most important observation is

that, overall, very little research literature isavailable about how graduate student academicperformance is assessed. Several general analy-ses of graduate education have dealt briefly withthe topic (often in quite critical terms), but verylittle serious, thoughtful examination has beenmade of what does (and/or should) constitutesuccessful student performance. To some extentthis lack of empirical attention can be explainedby a concert, among graduate faculty members,about too much emphasis on specifying programoutcomes. Some argue persuasively that the ma-jor strength of advanced study is its flexibilityand openness to intellectual idiosyncrasy. Thisview no doubt explains, at least in part, why stu-dent performance evaluation practices are

characterized by an almost bewildering diversityin graduate education, with even such assess-ment staples as course grades and comprehen-sive examinations varying from one program toanother in purpose, form, timing, and use.These general observations, along with the

more detailed shortcomings of specific criteriadiscussed earlier in the paper, suggest that thecriterion problem can be expected to continue tobe bothersome to those conducting research ongraduate student performance. Attention to cer-tain psychometric characteristics of standardcriteria (such as improving the inter-reader

agreement on comprehensive examinations) orthe willingness to consider the possible merits ofnew criteria (e.g., performance work samples,global ratings) may yield modest gains. But thenature of the measurement problems are so pro-nounced, and the logistic and philosophic reali-ties so chronic, that for the foreseeable futuremeasurement specialists will have to be contentwith less-than-satisfactory criterion measures

when embarking on research on graduate stu-dent performance.



290

References

Baird, L. L. Development of an inventory of docu-mented accomplishments: Report of phase I andproposal for phase II (GRE No. 77-3). Princeton,

NJ: Educational Testing Service, December, 1976.Baird, L. L. Final report on phase II of the project to

develop an inventory of documented accomplish-ments (Unpublished draft for the Graduate RecordExamination Board). Princeton, NJ: EducationalTesting Service, August 1978.

Baird, L. L., & Richards, J. M., Jr. The effects of se-lecting college students by various kinds of highschool achievements (ACT Research Report No.23). Iowa City, IA: American College TestingProgram, 1968.

Barritt, L. S. The consistency of first semester collegegrade-point average. Journal of Educational Mea-surement, 1966, 3, 261-262.

Berelson, B. Graduate education in the UnitedStates. New York: McGraw-Hill, 1960.

Boldt, R. F. Factor analysis of business school grades(Research Bulletin RB-70-49). Princeton, NJ:Educational Testing Service, 1970.

Bowers, J. E. A test of variation in grading standards.Educational and Psychological Measurement,1967,27,429-430.

Bretsch, H. A study of doctoral recipients, 1938-1958(Graduate Study No. 6). Ann Arbor: University ofMichigan, 1965. (mimeo)

Brogden, H. E., & Taylor, E. K. The theory andclassification of criterion bias. Educational and

Psychological Measurement, 1950, 10, 158-186.Bryant, N. D. A factor analysis of the report of officer

effectiveness. Lackland Air Force Base, TX: AirForce Personnel and Training Research Center,1956.

Caldwell, E., & Hartnett, R. T. Sex bias in collegegrading. Journal of Educational Measurement,1967, 4, 129-133.

Carlson, A. B., Evans, F. R., & Kuykendall, N. J. Thefeasibility of common criterion validity studies ofthe GRE (Research Memorandum 73-16). Prince-ton, NJ: Educational Testing Service, 1973.

Carlson, A. B., Reilly, R. R., Mahoney, M. H., &

Casserly, P. L. The development and pilot testingof criterion rating scales. Princeton, NJ: Educa-tional Testing Service, 1976.

Carmichael, O. C. Graduate education: A critiqueand a program. New York: Harper, 1961.

Clark, E. L. Reliability of grade-point averages. TheJournal of Educational Research, 1964, 57,428-430.

Clark, M. J., Hartnett, R. T., & Baird, L. L. Assess-ing dimensions of quality in doctoral education: A

technical report of a national study in three fields.Princeton, NJ: Educational Testing Service, 1976.

Cronbach, L. J. Essentials of psychological testing(3rd ed.). New York: Harper & Row, 1970.

Davis, J. A. What college teachers value in students.College Board Review, 1965, 56, 15-18.

Doermann, H. Baccalaureate origins and perfor-mance of students in the Harvard GraduateSchool of Arts and Sciences. Unpublished report,1968.

Eisenberg, T. Are doctoral comprehensive examina-tions necessary? American Psychologist, 1965, 20,168-169.

Frederiksen, N. There ought to be a law. Address pre-sented at the ETS Invitational Conference on

Testing Problems, October 1977.Frederiksen, N., & Ward, W. C. Measures for the

study of creativity in scientific problem-solving.Applied Psychological Measurement, 1978, 2,1-24.

French, J. W. The description of aptitude andachievement tests in terms of rotated factors. Psy-chometric Monographs, 1951 (No. 5).

Glaser, R., & Klaus, D. J. Proficiency measurement:Assessing human performance. In R. M. Gagné(Ed.), Psychological principles in system develop-ment. New York: Holt, Rinehart, & Winston,1962.

Goldman, R. D., & Slaughter, R. E. Why collegegrade-point average is difficult to predict. Journalof Educational Psychology, 1976, 68, 9-14.

Hackman, J. R., Wiggins, N., & Bass, A. R. Predic-tion of long term success in doctoral work in psy-chology. Educational and Psychological Measure-ment, 1970, 20, 365-374.

Halleck, S. L. Emotional problems of the graduatestudent. In J. Katz & R. T. Hartnett (Eds.), Schol-ars in the making. Cambridge, MA: Ballinger,1976.

Heine, R. W. Comparative performance of doctoralstudents admitted on the basis of traditional andnon-traditional criteria. Unpublished manuscript,University of Michigan, 1976.

Heiss, A. Challenges to graduate schools. SanFrancisco: Jossey-Bass, 1970.

Hilton, T. L., Kendall, L. M., & Sprecher, T. B. Per-formance criteria in graduate business study.Parts I and II: Development of rating scales,

background data, and pilot study (Research Bul-letin 70-3). Princeton, NJ: Educational TestingService, 1970.

Hirschberg, N., & Itkin, S. Graduate student successin psychology. American Psychologist, 1978, 33,1083-1093.



291

Holland, J. L., & Nichols, R. C. Prediction of aca-demic and extracurricular achievement in college.Journal of Educational Psychology, 1964, 55,55-65.

Humphreys, L. G. The fleeting nature of the predic-tion of college academic success. Journal of Edu-cational Psychology, 1968, 59, 375-380.

Juola, A. E. Freshman level ability tests versus cumu-lative grades in the prediction of successive termsperformance in college. Paper presented at the an-nual meeting of the American Educational Re-search Association, Chicago, February 1964.

Juola, A. E. Illustrative problems in college level

grading. Personnel and Guidance Journal, 1968,47, 29-33.

Katz, J., & Hartnett, R. Scholars in the making.Cambridge, MA: Ballinger, 1976.

Kelley, E. L., & Fiske, D. W. The prediction of per-formance in clinical psychology. Ann Arbor: Uni-versity of Michigan Press, 1951.

Mayhew, L. R., & Ford, P. J. Reform in graduate andprofessional education. San Francisco: Jossey-Bass, 1974.

McClelland, D. C. Testing for competence ratherthan for "intelligence." American Psychologist,1973, 28, 1-14.

McGuire, C. H., & Babbott, D. Simulation techniquein the measurement of problem-solving skills.Journal of Educational Measurement, 1967, 4,1-11.

Medsker, L. L., & Wattenbarger, J. L. An analysis ofdissertations, 1975. Mimeographed paper, Wal-den University, 1976.

Meeth, L. R., & Wattenbarger, J. L. Dissertation

quality at Walden University. Mimeographedpaper, Walden University, 1974.

National Academy of Sciences. Doctorate recipientsfrom United States universities, 1958-1966. Wash-ington, D. C., 1967.

Nichols, R. C., & Holland, J. L. Prediction of the

first-year college performance of high aptitudestudents. Psychological Monographs, 1963, 77 (7,Whole No. 570).

Porter, A. L., & Wolfle, D. Utility of the doctoral dis-sertation. American Psychologist, 1975, 30,1054-1061.

Reilly, R. R. Critical incidents of graduate studentperformance. Princeton, NJ: Educational TestingService, 1974. (a)

Reilly, R. R. Factors in graduate student perform-ance (Research Bulletin 74-2). Princeton, NJ:Educational Testing Service, 1974. (b)

Richards, J. M., Jr., Holland, J. L., & Lutz, S.W. Pre-diction of student accomplishment in college.Journal of Educational Psychology, 1967, 58,343-355.

Rimoldi, H. J. Rationale and application of the test ofdiagnostic skills. Journal of Medical Education,1963, 38, 364-373.

Rosenhaupt, H. Graduate students’ experience atColumbia University, 1940-1956. New York: Col-umbia University Press, 1958.

Sanford, M. Making it in graduate school.

Berkeley: Montaigne, 1976.Singer, J. E. The use of manipulative strategies:

Machiavellianism and attractiveness. Sociometry,1964,27, 128-150.

Smith, G. M. Usefulness of peer ratings of person-ality in educational research. Educational and

Psychological Measurement, 1967, 27, 967-984.Tucker, A., Gottlieb, D., & Pease, J. Factors related

to attrition among doctoral students (CooperativeResearch Project No. 1146). Washington, DC:U.S. Office of Education, 1964.

Tupes, E. C. Personality traits related to effectivenessof junior and senior air force officers. LacklandAir Force Base, TX: Air Force Personnel Trainingand Research Center, 1957. (a)

Tupes, E. C. Relationships between behavior traitratings by peers and later officer performance ofUSAF officer candidate school graduates. Lack-land Air Force Base, TX: Air Force PersonnelTraining and Research Center, 1957. (b)

Ward, W. C., & Frederiksen, N. A study of the pre-dictive validity of the tests of scientific thinking(Research Bulletin, 77-6). Princeton, NJ: Educa-tional Testing Service, 1977.

Wiggins, N., & Blackburn, M. Prediction of first-yeargraduate success in psychology: Peer ratings.Journal of Educational Research, 1969, 63, 81-85.

Willingham, W. W. Predicting success in graduateeducation. Science, 1974,183, 273-278.

Wilson, K. M. Of time and the doctorate. Atlanta,Ga.: Southern Regional Education Board, 1965.

Wilson, K. M. Internal progress report of the Grad-uate Record Examinations Board CooperativeValidity Studies Project. Princeton, NJ: Educa-tional Testing Service, 1978.

Wing, C. W., & Wallach, M. A. College admissionsand the psychology of talent. New York: Holt,Rinehart, & Winston, 1971.

AcknowledgmentThis research was supported by a grant from the

Graduate Record Examinations Board.

Author’s Address

Send requests for reprints or further information toRodney T. Hartnett, Senior Research Psychologist,Higher Education Research Group, Educational

Testing Service, Princeton, NJ 08541.



the criterion problem: what measure of success in graduate … · 2016. 5. 18. · 281 the...

Documents