december 1966 psychological bulletinstats.org.uk/statistical-inference/bakan1966.pdfvol. 66, no. 6...

15
VOL. 66, No. 6 DECEMBER 1966 Psychological Bulletin THE TEST OF SIGNIFICANCE IN PSYCHOLOGICAL RESEARCH DAVID BAKAN University of Chicago The test of significance does not provide the information concerning psycho- logical phenomena characteristically attributed to it; and a great deal of mis- chief has been associated with its use. The basic logic associated with the test of significance is reviewed. The null hypothesis is characteristically false under any circumstances. Publication practices foster the reporting of small effects in populations. Psychologists have "adjusted" by misinterpretation, taking the p value as a "measure," assuming that the test of significance provides automaticity of inference, and confusing the aggregate with the general. The difficulties are illuminated by bringing to bear the contributions from the decision-theory school on the Fisher approach. The Bayesian approach is suggested. That which we might identify as the "crisis of psychology" is closely related to what Hogben (1958) has called the "crisis in sta- tistical theory." The vast majority of investi- gations which pass for research in the field of psychology today entail the use of statistical tests of significance. Most characteristically, when a psychologist finds a problem he wishes to investigate he converts his intuitions and hypotheses into procedures which will yield a test of significance; and will characteristi- cally allow the result of the test of signifi- cance to bear the essential responsibility for the conclusions which he will draw. The major point of this paper is that the test of significance does not provide the information concerning psychological phe- nomena characteristically attributed to it; and that, furthermore, a great deal of mis- chief has been associated with its use. What will be said in this paper is hardly original. It is, in a certain sense, what "everybody knows." To say it "out loud" is, as it were, to assume the role of the child who pointed out that the emperor was really outfitted only in his underwear. Little of that which is contained in this paper is not already avail- able in the literature, and the literature will be cited. Lest what is being said in this paper be misunderstood, some clarification needs to be made at the outset. It is not a blanket criti- cism of statistics, mathematics, or, for that matter, even the test of significance when it can be appropriately used. The argument is rather that the test of significance has been carrying too much of the burden of scientific inference. Wise and ingenious investigators can find their way to reasonable conclusions from data because and in spite of their pro- cedures. Too often, however, even wise and ingenious investigators, for varieties of rea- sons not the least of which are the editorial policies of our major psychological journals, which we will discuss below, tend to credit the test of significance with properties it does not have. LOGIC OF THE TEST OF SIGNIFICANCE The test of significance has as its aim obtaining information concerning a character- istic of a population which is itself not di- rectly observable, whether for practical or more intrinsic reasons. What is observable is the sample. The work assigned to the test of significance is that of aiding in making inferences from the observed sample to the unobserved population. The critical assumption involved in testing significance is that, if the experiment is con- ducted properly, the characteristics of the population have a designably determinative 423

Upload: others

Post on 20-Mar-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: DECEMBER 1966 Psychological Bulletinstats.org.uk/statistical-inference/Bakan1966.pdfVOL. 66, No. 6 DECEMBER 1966 Psychological Bulletin THE TEST OF SIGNIFICANCE IN PSYCHOLOGICAL RESEARCH

VOL. 66, No. 6 DECEMBER 1966

Psychologica l Bul le t in

THE TEST OF SIGNIFICANCE IN PSYCHOLOGICAL RESEARCH

DAVID BAKAN

University of Chicago

The test of significance does not provide the information concerning psycho-logical phenomena characteristically attributed to it; and a great deal of mis-chief has been associated with its use. The basic logic associated with the testof significance is reviewed. The null hypothesis is characteristically false underany circumstances. Publication practices foster the reporting of small effects inpopulations. Psychologists have "adjusted" by misinterpretation, taking thep value as a "measure," assuming that the test of significance providesautomaticity of inference, and confusing the aggregate with the general. Thedifficulties are illuminated by bringing to bear the contributions from thedecision-theory school on the Fisher approach. The Bayesian approach issuggested.

That which we might identify as the "crisisof psychology" is closely related to whatHogben (1958) has called the "crisis in sta-tistical theory." The vast majority of investi-gations which pass for research in the field ofpsychology today entail the use of statisticaltests of significance. Most characteristically,when a psychologist finds a problem he wishesto investigate he converts his intuitions andhypotheses into procedures which will yielda test of significance; and will characteristi-cally allow the result of the test of signifi-cance to bear the essential responsibility forthe conclusions which he will draw.

The major point of this paper is that thetest of significance does not provide theinformation concerning psychological phe-nomena characteristically attributed to it;and that, furthermore, a great deal of mis-chief has been associated with its use. Whatwill be said in this paper is hardly original.It is, in a certain sense, what "everybodyknows." To say it "out loud" is, as it were,to assume the role of the child who pointedout that the emperor was really outfitted onlyin his underwear. Little of that which iscontained in this paper is not already avail-able in the literature, and the literature willbe cited.

Lest what is being said in this paper bemisunderstood, some clarification needs to be

made at the outset. It is not a blanket criti-cism of statistics, mathematics, or, for thatmatter, even the test of significance when itcan be appropriately used. The argument israther that the test of significance has beencarrying too much of the burden of scientificinference. Wise and ingenious investigatorscan find their way to reasonable conclusionsfrom data because and in spite of their pro-cedures. Too often, however, even wise andingenious investigators, for varieties of rea-sons not the least of which are the editorialpolicies of our major psychological journals,which we will discuss below, tend to creditthe test of significance with properties it doesnot have.

LOGIC OF THE TEST OF SIGNIFICANCE

The test of significance has as its aimobtaining information concerning a character-istic of a population which is itself not di-rectly observable, whether for practical ormore intrinsic reasons. What is observable isthe sample. The work assigned to the testof significance is that of aiding in makinginferences from the observed sample to theunobserved population.

The critical assumption involved in testingsignificance is that, if the experiment is con-ducted properly, the characteristics of thepopulation have a designably determinative

423

Page 2: DECEMBER 1966 Psychological Bulletinstats.org.uk/statistical-inference/Bakan1966.pdfVOL. 66, No. 6 DECEMBER 1966 Psychological Bulletin THE TEST OF SIGNIFICANCE IN PSYCHOLOGICAL RESEARCH

424 DAVID BAR AN

influence on samples drawn from it, that, forexample, the mean of a population has a de-terminative influence on the mean of a sampledrawn from it. Thus if P, the populationcharacteristic, has a determinative influenceon S, the sample characteristic, then thereis some license for making inferences fromS to P.

If the determinative influence of P on Scould be put in the form of simple logicalimplication, that P implies S, the problemwould be quite simple. For, then we wouldhave the simple situation: if P implies S, andif S is false, P is false. There are some lim-ited instances in which this logic applies di-rectly in sampling. For example, if the rangeof values in the population is between 3 and9 (P), then the range of values in any samplemust be between 3 and 9 (S). Should wefind a value in a sample of, say, 10, it wouldmean that S is false; and we could assertthat P is false.

It is clear from this, however, that, strictlyspeaking, one can only go from the denialof S to the denial of P; and not from theassertion of S to the assertion of P. It iswithin this context of simple logical implica-tion that the Fisher school of statisticianshave made important contributions—and it isextremely important to recognize this as thecontext.

In contrast, approaches based on thetheorem of Bayes (Bakan, 1953, 1956;Edwards, Lindman, & Savage, 1963; Keynes,1948; Savage, 1954; Schlaifer, 1959) wouldallow inferences to P from S even when S isnot denied, as S adding something to thecredibility of P when S is found to be the case.One of the most viable alternatives to the useof the test of significance involves the theoremof Bayes; and the paper by Edwards et al.(1963) is particularly directed to the atten-tion of psychologists for use in psychologicalresearch.

The notion of the null hypothesis1 pro-1 There is some confusion in the literature con-

cerning the meaning of the term null hypothesis.Fisher used the term to designate any exact hypothe-sis that we might be interested in disproving, and"null" was used in the sense of that which is to benullified (cf., e.g., Berkson, 1942). It has, however,also been used to indicate a parameter of zero (cf.,

moted by Fisher constituted an advance with-in this context of simple logical implication. Itallowed experimenters to set up a null hy-pothesis complementary to the hypothesisthat the investigator was interested in, andprovided him with a way of positively con-firming his hypothesis. Thus, for example, theinvestigator might have the hypothesis that,say, normals differ from schizophrenics. Hewould then set up the null hypothesis thatthe means in the population of all normalsand all schizophrenics were equal. Thus, therejection of the null hypothesis constituteda way of asserting that the means of thepopulations of normals and schizophrenicswere different, a completely reasonable devicewhereby to affirm a logical antecedent.

The model of simple logical implication formaking inferences from S to P has anotherdifficulty which the Fisher approach soughtto overcome. This is that it is rarely mean-ingful to set up any simple "P implies S"model for parameters that we are interestedin. In the case of the mean, for example, itis rather that P has a determinative influ-ence on the frequency of any specific S. Butone experiment does not provide many valuesof S to allow the study of their frequencies.It gives us only one value of S. The samplingdistribution is conceived which specifies therelative frequencies of all possible values ofS. Then, with the help of an adopted level ofsignificance, we could, in effect, say that Swas false; that is, any S which fell in aregion whose relative theoretical frequencyunder the null hypothesis was, say, 5%would be considered false. If such an S actu-ally occurred, we would be in a position todeclare P to be false, still within the modelof simple logical implication.

It is important to recognize that one ofthe essential features of the Fisher approachis what may be called the once-ness of theexperiment; the inference model takes ascritical that the experiment has been con-ducted once. If an S which has a low proba-

e.g., Lindquist, 1940, p. IS), that the difference be-tween the population means is zero, or the correla-tion coefficient in the population is zero, the differ-ence in proportions in the population is zero, etc.Since both meanings are usually intended in psycho-logical research, it causes little difficulty.

Page 3: DECEMBER 1966 Psychological Bulletinstats.org.uk/statistical-inference/Bakan1966.pdfVOL. 66, No. 6 DECEMBER 1966 Psychological Bulletin THE TEST OF SIGNIFICANCE IN PSYCHOLOGICAL RESEARCH

TEST OF SIGNIFICANCE IN PSYCHOLOGICAL RESEARCH 425

bility under the null hypothesis actually oc-curs, it is taken that the null hypothesis isfalse. As Fisher (1947, p. 14) put it, whyshould the theoretically rare event under thenull hypothesis actually occur to "us"? If itdoes occur, we take it that the null hypothesisis false. Basic is the idea that "the theo-retically unusual does not happen to me."2

It should be noted that the referent for allprobability considerations is neither in thepopulation itself nor the subjective confidenceof the investigator. It is rather in a hypotheti-cal population of experiments all conducted inthe same manner, but only one oj which isactually conducted. Thus, of course, theprobability of falsely rejecting the null hy-pothesis if it were true is exactly that valuewhich has been taken as the level of signifi-cance. Replication of the experiment vitiatesthe validity of the inference model, unless thereplication itself is taken into account in themodel and the probabilities of the modelmodified accordingly (as is done in variousdesigns which entail replication, where, how-ever, the total experiment, including thereplications, is again considered as one experi-ment). According to Fisher (1947), "it is anessential characteristic of experimentationthat it is carried out with limited resources[p. 18]." In the Fisher approach, the "lim-ited resources" is not only a making of thebest out of a limited situation, but is ratheran integral feature of the inference modelitself. Lest he be done a complete injustice,it should be pointed out that he did say, "Inrelation to the test of significance, we may

21 playfully once conducted the following "experi-ment": Suppose, I said, that every coin has associ-ated with it a "spirit"; and suppose, furthermore,that if the spirit is implored properly, the coin willveer head or tail as one requests of the spirit. Ithus invoked the spirit to make the coin fall head.I threw it once, it came up head. I did it again, itcame up head again. I did this six times, and gotsix heads. Under the null hypothesis the probabilityof occurrence of six heads is (4)6=.016, significantat the 2% level of significance. I have never repeatedthe experiment. But, then, the logic of the inferencemodel does not really demand that I do! It maybe objected that the coin, or my tossing, or even myobservation was biased. But I submit that suchthings were in all likelihood not as involved in theresult as corresponding things in most psychologicalresearch.

say that a phenomenon is experimentallydemonstrable when we know how to conductan experiment which will rarely fail to giveus statistically significant results [1947, p.14]." However, although Fisher "himself"believes this, it is not built into the inferencemodel.3

DIFFICULTIES OF THE NULL HYPOTHESIS

As already indicated, research workers inthe field of psychology place a heavy burdenon the test of significance. Let us considersome of the difficulties associated with thenull hypothesis.

1. The a priori reasons for believing thatthe null hypothesis is generally false anyway.One of the common experiences of researchworkers is the very high frequency withwhich significant results are obtained withlarge samples. Some years ago, the authorhad occasion to run a number of tests ofsignificance on a battery of tests collectedon about 60,000 subjects from all over theUnited States. Every test came out signifi-cant. Dividing the cards by such arbitrarycriteria as east versus west of the MississippiRiver, Maine versus the rest of the country,North versus South, etc., all produced signifi-cant differences in means. In some instances,the differences in the sample means werequite small, but nonetheless, the p values wereall very low. Nunnally (1960) has reporteda similar experience involving correlation co-efficients on 700 subjects. Joseph Berkson(1938) made the observation almost 30 yearsago in connection with chi-square:

I believe that an observant statistician who hashad any considerable experience with applying thechi-square test repeatedly will agree with my state-ment that, as a matter of observation, when thenumbers in the data are quite large, the P's tend tocome out small. Having observed this, and on re-flection, I make the following dogmatic statement,referring for illustration to the normal curve: "If

3 Possibly not even this criterion is sound. It maybe that a number of statistically significant resultswhich are borderline "speak for the null hypothesisrather than against it [Edwards et al., 1963, p.235]." If the null hypothesis were really false, thenwith an increase in the number of instances in whichit can be rejected, there should be some substantialproportion of more dramatic rejections rather thanborderline rejections.

Page 4: DECEMBER 1966 Psychological Bulletinstats.org.uk/statistical-inference/Bakan1966.pdfVOL. 66, No. 6 DECEMBER 1966 Psychological Bulletin THE TEST OF SIGNIFICANCE IN PSYCHOLOGICAL RESEARCH

426 DAVID BAKAN

the normal curve is fitted to a body of data repre-senting any real observations whatever of quantitiesin the physical world, then if the number of observa-tions is extremely large—for instance, on an orderof 200,000—the chi-square P will be small beyondany usual limit of significance."

This dogmatic statement is made on the basis ofan extrapolation of the observation referred to andcan also be defended as a prediction from a prioriconsiderations. For we may assume that it is prac-tically certain that any series of real observationsdoes not actually follow a normal curve with abso-lute exactitude in all respects, and no matter howsmall the discrepancy between the normal curve andthe true curve of observations, the chi-square Pwill be small if the sample has a sufficiently largenumber of observations in it.

If this be so, then we have something here thatis apt to trouble the conscience of a reflective statis-tician using the chi-square test. For I suppose itwould be agreed by statisticians that a large sampleis always better than a small sample. If, then, weknow in advance the P that will result from anapplication of a chi-square test to a large sample,there would seem to be no use in doing it on asmaller one. But since the result of the former testis known, it is no test at all [pp. 526-527].

As one group of authors has put it, "in typicalapplications . . . the null hypothesis . . . isknown by all concerned to be false from theoutset [Edwards et al, 1963, p. 214]." Thefact of the matter is that there is really nogood reason to expect the null hypothesis tobe true in any population. Why should themean, say, of all scores east of the Mississippibe identical to all scores west of the Missis-sippi? Why should any correlation coefficientbe exactly .00 in the population? Why shouldwe expect the ratio of males to females beexactly SO: 50 in any population? Or whyshould different drugs have exactly the sameeffect on any population parameter (Smith,I960)? A glance at any set of statistics ontotal populations will quickly confirm therarity oj the null hypothesis in nature.

The reason why the null hypothesis ischaracteristically rejected with large sampleswas made patent by the theoretical work ofNeyman and Pearson (1933). The probabil-ity of rejecting the null hypothesis is afunction of five factors: whether the test isone- or two-tailed, the level of significance,the standard deviation, the amount of devia-tion from the null hypothesis, and the numberoj observations. The choice of a one- or two-tailed test is the investigator's; the level of

significance is also based on the choice of theinvestigator; the standard deviation is a givenof the situation, and is characteristically rea-sonably well estimated; the deviation fromthe null hypothesis is what is unknown; andthe choice of the number of cases is in psy-chological work is characteristically arbitraryor expeditious. Should there be any deviationfrom the null hypothesis in the population,no matter how small—and we have littledoubt but that such a deviation usually ex-ists—a sufficiently large number of observa-tions will lead to the rejection of the nullhypothesis. As Nunnally (I960) put it,

if the null hypothesis is not rejected, it is usuallybecause the A? is too small. If enough data aregathered, the hypothesis will generally be rejected.If rejection of the null hypothesis were the realintention in psychological experiments, there usuallywould be no need to gather data [p. 643].

2. Type I error and publication practices.The Type I error is the error of rejecting thenull hypothesis when it is indeed true, andits probability is the level of significance. Laterin this paper we will discuss the distinctionbetween sharp and loose null hypotheses.The sharp null hypothesis, which we havebeen discussing, is an exact value for the nullhypothesis as, for example, the differencebetween population means being preciselyzero. A loose null hypothesis is one in whichit is conceived of as being around null. Sharpnull hypotheses, as we have indicated, rarelyexist in nature. Assuming that loose nullhypotheses are not rare, and that their testingmay make sense under some circumstances,let us consider the role of the publicationpractices of our journals in their connection.

It is the practice of editors of our psycho-logical journals, receiving many more papersthan they can possibly publish, to use themagnitude of the p values reported as onecriterion for acceptance or rejection of astudy. For example, consider the followingstatement made by Arthur W. Melton (1962)on completing 12 years as editor of theJournal oj Experimental Psychology, cer-tainly one of the most prestigious and scien-tifically meticulous psychological journals. Inenumerating the criteria by which articleswere evaluated, he said:

Page 5: DECEMBER 1966 Psychological Bulletinstats.org.uk/statistical-inference/Bakan1966.pdfVOL. 66, No. 6 DECEMBER 1966 Psychological Bulletin THE TEST OF SIGNIFICANCE IN PSYCHOLOGICAL RESEARCH

TEST OF SIGNIFICANCE IN PSYCHOLOGICAL RESEARCH 427

The next step in the assessment of an article in-volved a judgment with respect to the confidenceto be placed in the findings—confidence that theresults of the experiment would be repeatable underthe conditions described. In editing the Journal therehas been a strong reluctance to accept and publishresults related to the principal concern of the re-search when those results were significant at the .05level, whether by one- or two-tailed test. This hasnot implied a slavish worship of the .01 level, assome critics may have implied. Rather, it reflectsa belief that it is the responsibility of the investi-gator in a science to reveal his effect in such a waythat no reasonable man would be in a position todiscredit the results by saying that they were theproduct of the way the ball bounces [pp. 553-554].

His clearly expressed opinion that non-significant results should not take up thespace of the journals is shared by most edi-tors of psychological journals. It is importantto point out that I am not advocating achange in policy in this connection. In thetotal research enterprise where so much ofthe load for making inferences concerning thenature of phenomena is carried by the testof significance, the editors can do little else.The point is rather that the situation inregard to publication makes manifest the dif-ficulties in connection with the overemphasison the test of significance as a principal basisfor making inferences.

McNemar (1960) has rightly pointed outthat not only do journal editors reject papersin which the results are not significant, butthat papers in which significance has not beenobtained are not submitted, that investigatorsselect out their significant findings for inclu-sion in their reports, and that theory-orientedresearch workers tend to discard data whichdo not work to confirm their theories. Theresult of all of this is that "published resultsare more likely to involve false rejection ofnull hypotheses than indicated by the statedlevels of significance [p. 300]," that is, pub-IMed results which are significant may wellhave Type I errors in them far in excess of,say, the 5% which we may allow ourselves.

The suspicion that the Type I error maywell be plaguing our literature is given con-firmation in an analysis of articles publishedin the Journal of Abnormal and Social Psy-chology for one complete year (Cohen, 1962).Analyzing 70 studies in which significant re-sults were obtained with respect to the power

of the statistical tests used, Cohen found thatpower, the probability of rejecting the nullhypothesis when the null hypothesis wasfalse, was characteristically meager. Theo-retically, with such tests, one should not oftenexpect significant results even when the nullhypothesis was false. Yet, there they werelEven if deviations from null existed in therelevant populations, the investigations werecharacteristically not powerful enough to havedetected them. This strongly suggests thatthere is something additional associated withthese rejections of the null hypotheses inquestion. It strongly points to the possibilitythat the manner in which studies get pub-lished is associated with the findings; thatthe very publication practices themselves arepart and parcel of the probabilistic processeson which we base our conclusions concerningthe nature of psychological phenomena. Ourtotal research enterprise is, at least in part,a kind of scientific roulette, in which the"lucky," or constant player, "wins," that is,gets his paper or papers published. And cer-tainly, going from 5% to \% does not elimi-nate the possibility that it is "the way theball bounces," to use Melton's phrase. Itchanges the odds in this roulette, but it doesnot make it less a game of roulette.

The damage to the scientific enterprise iscompounded by the fact that the publicationof "significant" results tends to stop furtherinvestigation. If the publication of paperscontaining Type I errors tended to fosterfurther investigation so that the psychologicalphenomena with which we are concernedwould be further probed by others, it wouldnot be too bad. But it does not. Quite thecontrary. As Lindquist (1940, p. 17) has cor-rectly pointed out, the danger to science ofthe Type I error is much more serious thanthe Type II error—for when a Type I erroris committed, it has the effect of stopping in-vestigation, A highly significant result ap-pears definitive, as Melton's comments indi-cate. In the 12 years that he edited theJournal of Experimental Psychology, hesought to select papers which were worthy ofbeing placed in the "archives," as he put it.Even the strict repetition of an experimentand not getting significance in the same waydoes not speak against the result already re-

Page 6: DECEMBER 1966 Psychological Bulletinstats.org.uk/statistical-inference/Bakan1966.pdfVOL. 66, No. 6 DECEMBER 1966 Psychological Bulletin THE TEST OF SIGNIFICANCE IN PSYCHOLOGICAL RESEARCH

428 DAVID BAKAN

ported in the literature. For failing to getsignificance, speaking strictly within the in-ference model, only means that that experi-ment is inconclusive; whereas the study al-ready reported in the literature, with a lowp value, is regarded as conclusive. Thus wetend to place in the archives studies with arelatively high number of Type I errors, or,at any rate, studies which reflect small devia-tions from null in the respective populations;and we act in such a fashion as to reducethe likelihood of their correction.

PSYCHOLOGIST'S "ADJUSTMENT" BYMISINTERPRETATION

The psychological literature is filled withmisinterpretations of the nature of the testof significance. One may be tempted to at-tribute this to such things as lack of propereducation, the simple fact that humans mayerr, and the prevailing tendency to take acookbook approach in which the mathemati-cal and philosophical framework out of whichthe tests of significance emerge are ignored;that, in other words, these misinterpretationsare somehow the result of simple intellectualinadequacy on the part of psychologists.However, such an explanation is hardly ten-able. Graduate schools are adamant withrespect to statistical education. Any numberof psychologists have taken out substantialamounts of time to equip themselves mathe-matically and philosophically. Psychologistsas a group do a great deal of mutual criti-cism. Editorial reviews prior to publicationare carried out with eminent conscientious-ness. There is even a substantial literaturedevoted to various kinds of "misuse" of sta-tistical procedures, to which not a littleattention has been paid.

It is rather that the test of significanceis profoundly interwoven with other strandsof the psychological research enterprise insuch a way that it constitutes a critical partof the total cultural-scientific tapestry. Topull out the strand of the test of significancewould seem to make the whole tapestry fallapart. In the face of the intrinsic difficultiesthat the test of significance provides, werather attempt to make an "adjustment" byattributing to the test of significance charac-teristics which it does not have, and overlook

characteristics that it does have. The dif-ficulty is that the test of significance can,especially when not considered too carefully,do some work; for, after all, the results ofthe test of significance are related to the phe-nomena in which we are interested. One maywell ask whether we do not have here, per-haps, an instance of the phenomenon thatlearning under partial reinforcement is veryhighly resistant to extinction. Some of thesemisinterpretations are as follows:

1. Taking the p value as a "measure" ofsignificance. A common misinterpretation ofthe test of significance is to regard it as a"measure" of significance. It is interpretedas the answer to the question "How signifi-cant is it?" A p value of .05 is thought of asless significant than a p value of .01, and soon. The characteristic practice on the part ofpsychologists is to compute, say, a t, and then"look up" the significance in the table, takingthe p value as a junction of t, and thereby a"measure" of significance. Indeed, since thep value is inversely related to the magnitudeof, say, the difference between means in thesample, it can function as a kind of "standardscore" measure for a variety of different ex-periments. Mathematically, the t is actuallyvery similar to a "standard score," entailinga deviation in the numerator, and a functionof the variation in the denominator; and thep value is a "function" of t. If this use wereexplicit, it would perhaps not be too bad.But it must be remembered that this is usingthe p value as a statistic descriptive of thesample alone, and does not automatically givean inference to the population. There is eventhe practice of using tests of significance instudies of total populations, in which theobservations cannot by any stretch of theimagination be thought of as having beenrandomly selected from any designable popu-lation.* Using the p value in this way, inwhich the statistical inference model is evenhinted at, is completely indefensible; for thesingle function of the statistical inferencemodel is making inferences to populationsfrom samples.

The practice of "looking up" the p value4 It was decided not to cite any specific studies

to exemplify points such as this one. The reader willundoubtedly be able to supply them for himself.

Page 7: DECEMBER 1966 Psychological Bulletinstats.org.uk/statistical-inference/Bakan1966.pdfVOL. 66, No. 6 DECEMBER 1966 Psychological Bulletin THE TEST OF SIGNIFICANCE IN PSYCHOLOGICAL RESEARCH

TEST OF SIGNIFICANCE IN PSYCHOLOGICAL RESEARCH 429

for the t, which has even been advocated insome of our statistical handbooks (e.g., Lacey,1953, p. 117; Underwood, Duncan, Taylor,& Cotton, 1954, p. 129), rather than lookingup the t for a given p value, violates the in-ference model. The inference model is basedon the presumption that one initially adoptsa level of significance as the specification ofthat probability which is too slow to occur to"us," as Fisher has put it, in this one instance,and under the null hypothesis. A purist mightspeak of the "delicate problem . . . of fudgingwith a posteriori alpha values [levels of sig-nificance. Kaiser, 1960, p. 165]," as thoughthe levels of significance were initially decidedupon, but rarely do psychological researchworkers or editors take the level of signifi-cance as other than a "measure."

But taken as a "measure," it is only ameasure of the sample. Psychologists oftenerroneously believe that the p value is "theprobability that the results are due to chance,"as Wilson (1961, p. 230) has pointed out;that a p value of .05 means that the chancesare .95 that the scientific hypothesis is cor-rect, as Bolles (1962) has pointed out; thatit is a measure of the power to "predict" thebehavior of a population (Underwood et al.,1954, p. 107); and that it is a measure ofthe "confidence that the results of the experi-ment would be repeatable under the condi-tions described," as Melton put it. Unfortu-nately, none of these interpretations arewithin the inference model of the test of sig-nificance. Some of our statistical handbookshave "allowed" misinterpretation, For ex-ample, in discussing the erroneous rhetoricassociated with talking of the "probability"of a population parameter (in the inferencemodel there is no probability associated withsomething which is either true or false),Lindquist (1940) said, "For most practicalpurposes, the end result is the same as if the'level of confidence' type of interpretation isemployed [p. 14]." Ferguson (1959) wrote,"The .05 and .01 probability levels are de-scriptive of our degree of confidence [p.133]." There is little question but thatsizable differences, correlations, etc., insamples, especially samples of reasonable size,speak more strongly of sizable differences,correlations, etc., in the population; and there

is little question but that if there is real andstrong effect in the population, it will con-tinue to manifest itself in further sampling.However, these are inferences which we maymake. They are outside the inference modelassociated with the test of significance. Thep value within the inference model is onlythe value which we take to be as how im-probable an event could be under the nullhypothesis, which we judge will not takeplace to "us," in this one experiment. It isnot a "measure" of the goodness of the otherinferences which we might make. It is an apriori condition that we set up whereby wedecide whether or not we will reject the nullhypothesis, not a measure of significance.

There is a study in the literature (Rosen-thai & Gaito, 1963) which points up sharplythe lack of understanding on the part ofpsychologists of the meaning of the test ofsignificance. The subjects were 9 membersof the psychology department faculty, allholding doctoral degrees, and 10 graduatestudents, at the University of North Dakota;and there is little reason to believe that thisgroup of psychologists was more or lesssophisticated than any other. They wereasked to rate their degree of belief or con-fidence in results of hypothetical studies fora variety of p values, and for »'s of 10 and100. That there should be a relationship be-tween the average rated confidence or beliefand p value, as they found, is to be expected.What is shocking is that these psychologistsindicated substantially greater confidence orbelief in results associated with the largersample size for the same p values I Accordingto the theory, especially as this has beenamplified by Neyman and Pearson (1933),the probability of rejecting the null hypothe-sis for any given deviation from null and pvalue increases as a function of the numberof observations. The rejection of the nullhypothesis when the number of cases is smallspeaks for a more dramatic effect in thepopulation; and if the p value is the same,the probability of committing a Type I errorremains the same. Thus one can be more con-fident with a small n than a large n. Thequestion is, how could a group of psycholo-gists be so wrong? I believe that thiswrongness is based on the commonly held

Page 8: DECEMBER 1966 Psychological Bulletinstats.org.uk/statistical-inference/Bakan1966.pdfVOL. 66, No. 6 DECEMBER 1966 Psychological Bulletin THE TEST OF SIGNIFICANCE IN PSYCHOLOGICAL RESEARCH

430 DAVID BAR AN

belief that the p value is a "measure" ofdegree of confidence. Thus, the reasoning be-hind such a wrong set of answers by thesepsychologists may well have been somethinglike this: the p value is a measure of con-fidence; but a larger number of cases alsoincreases confidence; therefore, for any givenp value, the degree of confidence should behigher for the larger n. The wrong conclusionarises from the erroneous character of thefirst premise, and from the failure to recog-nize that the p value is a function of samplesize for any given deviation from null in thepopulation. The author knows of instancesin which editors of very reputable psycho-logical journals have rejected papers in whichthe p values and «'s were small on the groundsthat there were not enough observations,clearly demonstrating that the same mode ofthought is operating in them. Indeed, reject-ing the null hypothesis with a small n isindicative of a strong deviation from null inthe population, the mathematics of the testof significance having already taken into ac-count the smallness of the sample. Increasingthe n increases the probability of rejecting thenull hypothesis; and in these studies rejectedfor small sample size, that task has alreadybeen accomplished. These editors are, ofcourse, in some sense the ultimate "teachers"of the profession; and they have been teach-ing something which is patently wrong!

2. Automaticity of inference. What may beconsidered to be a dream, fantasy, or idealin the culture of psychology is that of achiev-ing complete automaticity of inference. Themaking of inductive generalizations is alwayssomewhat risky. In Fisher's The Design 0}Experiments (1947, p. 4), he made the claimthat the methods of induction could be maderigorous, exemplified by the procedures whichhe was setting forth. This is indeed quitecorrect in the sense indicated earlier. In alater paper, he made explicit what wasstrongly hinted at in his earlier writing, thatthe methods which he proposed constituted arelatively complete specification of the processof induction:

That such a process induction existed and was pos-sible to normal minds, has been understood forcenturies; it is only with the recent developmentpf statistical science that an analytic account can

now be given, about as satisfying and complete, atleast, as that given traditionally of the deductiveprocesses [Fisher, 1955, p. 74].

Psychologists certainly took the proceduresassociated with the t test, F test, and so on,in this manner. Instead of having to engagein inference themselves, they had but to "runthe tests" for the purpose of making infer-ences, since, as it appeared, the statisticaltests were analytic analogues of inductive in-ference. The "operationist" orientation amongpsychologists, which recognized the con-tingency of knowledge on the knowledge-get-ting operations and advocated their specifica-tion, could, it would seem, "operationalize"the inferential processes simply by reportingthe details of the statistical analysis! It thusremoved the burden of responsibility, thechance of being wrong, the necessity formaking inductive inferences, from the shoul-ders of the investigator and placed them onthe tests of significance. The contingency ofthe conclusion upon the experimenter's deci-sion of the level of significance was managedin two ways. The first, by resting on a kind ofsocial agreement that 5% was good, and 1%better. The second in the manner whichhas already been discussed, by not making adecision of the level of significance, but onlyreporting the p value as a "result" and apresumably objective "measure" of degree ofconfidence. But that the probability of gettingsignificance is also contingent upon thenumber of observations has been handledlargely by ignoring it.

A crisis was experienced among psycholo-gists when the matter of the one- versus thetwo-tailed test came into prominence; for herethe contingency of the result of a test ofsignificance on a decision of the investigatorwas simply too conspicuous to be ignored. Aninvestigator, say, was interested in the dif-ference between two groups on some measure.He collected his data, found that MeanA was greater than Mean B in the sample,and ran the ordinary two-tailed t test; and,let us say, it was not significant. Then he be-thought himself. The two-tailed test testedagainst two alternatives, that the populationMean A was greater than population Mean Band vice versa. But then, he really wanted toknow whether Mean A was greater than

Page 9: DECEMBER 1966 Psychological Bulletinstats.org.uk/statistical-inference/Bakan1966.pdfVOL. 66, No. 6 DECEMBER 1966 Psychological Bulletin THE TEST OF SIGNIFICANCE IN PSYCHOLOGICAL RESEARCH

TEST OF SIGNIFICANCE IN PSYCHOLOGICAL RESEARCH 431

Mean B. Thus, he could run a one-tailedtest. He did this and found, since the one-tailed test is more powerful, that his differ-ence was now significant.

Now here there was a difficulty. The testof significance is not nearly so automatic aninference process as had been thought. Itis manifestly contingent on the decisionof the investigator as to whether to runa one- or a two-tailed test. And somehow,making the decision after the data were col-lected and the means computed, seemed like"cheating." How should this be handled?Should there be some central registry inwhich one registers one's decision to run aone- or two-tailed test before collecting thedata? Should one, as one eminent psycholo-gist once suggested to me, send oneself aletter so that the postmark would prove thatone had pre-decided to run a one-tailed test?The literature on ways of handling this dif-ficulty has grown quite a bit in the strain tosomehow overcome this particular clear con-tingency of the results of a test of significanceon the decision of the investigator. Theauthor will not attempt here to review thisliterature, except to cite one very competentpaper which points up the intrinsic difficultyassociated with this problem, the reductioad absurdum to which one comes. Kaiser(1960), early in his paper, distinguished be-tween the logic associated with the test ofsignificance and other forms of inference, adistinction which, incidentally, Fisher wouldhardly have allowed: "The arguments de-veloped in this paper are based on logicalconsiderations in statistical inference. (We donot, of course, suggest that statistical infer-ence is the only basis for scientific inference)[p. 160]." But then, having taken the posi-tion that he is going to follow the logic ofstatistical inference relentlessly, he said(Kaiser's italics): "we cannot logically makea directional statistical decision or statementwhen the null hypothesis is rejected on thebasis of the direction of the difference in theobserved sample means [p. 161]." One reallyneeds to strike oneself in the head! If SampleMean A is greater than Sample Mean B, andthere is reason to reject the null hypothesis,in what other direction can it reasonably be?What kind of logic is it that leads one to be-

lieve that it could be otherwise than thatPopulation Mean A is greater than PopulationMean B? We do not know whether Kaiserintended his paper as a reductio ad absurdum,but it certainly turned out that way.

The issue of the one- versus the two-tailedtest genuinely challenges the presumptive"objectivity" characteristically attributed tothe test of significance. On the one hand, itmakes patent what was the case under anycircumstances (at the least in the choice oflevel of significance, and the choice of thenumber of cases in the sample), that theconclusion is contingent upon the decisionof the investigator. An astute investigator,who foresaw the results, and who thereforepre-decided to use a one-tailed test, will getone p value. The less astute but honorableinvestigator, who did not foresee the results,would feel obliged to use a two-tailed test,and would get another p value. On the otherhand, if one decides to be relentlessly logicalwithin the logic of statistical inference, onewinds up with the kind of absurdity whichwe have cited above.

3. The confusion of induction to the ag-gregate with induction to the general. Con-sider a not atypical investigation of thefollowing sort: A group of, say, 20 normalsand a group of, say, 20 schizophrenics aregiven a test. The tests are scored, and a ttest is run, and it is found that the meansdiffer significantly at some level of signifi-cance, say 1%. What inference can bedrawn? As we have already indicated, theinvestigator could have insured this resultby choosing a sufficiently large number ofcases. Suppose we overlook this objection,which we can to some extent, by saying thatthe difference between the means in thepopulation must have been large enough tohave manifested itself with only 40 cases.But still, what do we know from this? Theonly inference which this allows is that themean of all normals is different from themean of all schizophrenics in the populationsfrom which the samples have presumablybeen drawn at random. (Rarely is the cri-terion of randomness satisfied. But let usoverlook this objection too.)

The common rhetoric in which such resultsare discussed is in the form "Schizophrenics

Page 10: DECEMBER 1966 Psychological Bulletinstats.org.uk/statistical-inference/Bakan1966.pdfVOL. 66, No. 6 DECEMBER 1966 Psychological Bulletin THE TEST OF SIGNIFICANCE IN PSYCHOLOGICAL RESEARCH

432 DAVID BAKAN

differ from normals in such and such ways."The sense that both the reader and the writerhave of this rhetoric is that it has beenjustified by the finding of significance. Yetclearly it does not mean all schizophrenicsand att normals. All that the test of signifi-cance justifies is that measures of centraltendency of the aggregates differ in the popu-lations. The test of significance has not ad-dressed itself to anything about the schizo-phrenia or normality which characterizes eachmember of the respective populations. Nowit is certainly possible for an investigator todevelop a hypothesis about the nature ofschizophrenia from which he may infer thatthere should be differences between the meansin the populations; and his finding of a sig-nificant difference in the means of his samplewould add to the credibility of the former.However, that \% which he obtained in hisstudy bears only on the means of the popula-tions, and is not a "measure" of the confi-dence that he may have in his hypothesisconcerning the nature of schizophrenia. Thereare two inferences that he must make. Oneis that of the sample to the population, forwhich the test of significance is of some use.The other is from his inference concerningthe population to his hypothesis concerningthe nature of schizophrenia. The p value doesnot bear on this second inference. Thepsychological literature is filled with asser-tions which confound these two inferentialprocesses.

Or consider another hardly atypical styleof research. Say an experimenter divides 40subjects at random into two groups of 20subjects each. One group is assigned to onecondition and the other to another condition,perhaps, say, massing and distribution oftrials. The subjects are given a learning task,one group under massed conditions, the otherunder distributed conditions. The experi-menter runs a t test on the learning measureand again, say, finds that the difference issignificant at the 1 % level of significance. Hemay then say in his report, being more care-ful than the psychologist who was studyingthe difference between normals and schizo-phrenics (being more "scientific" than hisclinically-interested colleague), that "themean in the population of learning under

massed conditions is lower than the mean inthe population of learning under distributedconditions," feeling that he can say this witha good deal of certainty because of his testof significance. But here too (like his clinicalcolleague) he has made two inferences, andnot one, and the 1% bears on the one butnot the other. The statistical inference modelcertainly allows him to make his statementfor the population, but only for that learn-ing task, and the p value is appropriate onlyto that. But the generalization to "massedconditions" and "distributed conditions" be-yond that particular learning task is a secondinference with respect to which the p valueis not relevant. The psychological literatureis plagued with any number of instances inwhich the rhetoric indicates that the p valuedoes bear on this second inference.

Part of the blame for this confusion canbe ascribed to Fisher who, in The Designof Experiments (1947, p. 9), suggested thatthe mathematical methods which he proposedwere exhaustive of scientific induction, andthat the principles he was advancing were"common to all experimentation." What hefailed to see and to say was that after aninference was made concerning a populationparameter, one still needed to engage ininduction to obtain meaningful scientificpropositions.

To regard the methods of statistical infer-ence as exhaustive of the inductive inferencescalled for in experimentation is completelyconfounding. When the test of significancehas been run, the necessity for induction hashardly been completely satisfied. However,the research worker knows this, in some sense,and proceeds, as he should, to make furtherinductive inferences. He is, however, still en-snarled in his test of significance and thepresumption that it is the whole of his in-ductive activity, and thus mistakenly takesa low p value for the measure of the validityof his other inductions.

The seriousness of this confusion may beseen by again referring back to the Rosenthaland Gaito (1963) study and the remark byBerkson which indicate that research workersbelieve that a large sample is better thana small sample. We need to refine the rhetoricsomewhat. Induction consists in making in-

Page 11: DECEMBER 1966 Psychological Bulletinstats.org.uk/statistical-inference/Bakan1966.pdfVOL. 66, No. 6 DECEMBER 1966 Psychological Bulletin THE TEST OF SIGNIFICANCE IN PSYCHOLOGICAL RESEARCH

TEST OF SIGNIFICANCE IN PSYCHOLOGICAL RESEARCH 433

ferences from the particular to the general.It is certainly the case that as confirmingparticulars are added, the credibility of thegeneral is increased. However, the addition ofobservations to a, sample is, in the contextof statistical inference, not the addition ofparticulars but the modification of what isone particular in the inference model, thesample aggregate. In the context of statisticalinference, it is not necessarily true that "alarge sample is better than a small sample."For, as has been already indicated, obtaininga significant result with a small sample sug-gests a larger deviation from null in thepopulation, and may be considerably moremeaningful. Thus more particulars are betterthan fewer particulars on the making of aninductive inference; but not necessarily alarger sample.

In the marriage of psychological researchand statistical inference, psychology broughtits own reasons for accepting this confusion,reasons which inhere in the history of psy-chology. Measurement psychology arises outof two radically different traditions, as hasbeen pointed out by Guilford (1936, pp. 5 ff.)and Cronbach (1957), and the matter of put-ting them together raised certain difficulties.The one tradition seeks to find propositionsconcerning the nature of man in general—propositions of a general nature, with eachindividual a particular in which the generalin manifest. This is the kind of psychologyassociated with the traditional experimentalpsychology of Fechner, Ebbinghaus, Wundt,and Titchener. It seeks to find the laws whichcharacterize the "generalized, normal, human,adult mind [Boring, 1950, p. 413]." Theresearch strategy associated with this kind ofpsychology is straightforwardly inductive. Itseeks inductive generalizations which will ap-ply to every member of a designated class.A single particular in which & generalizationfails forces a rejection of the generalization,calling for either a redefinition of the classto which it applies or a modification of thegeneralization. The other tradition is thepsychology of individual differences, whichhas its roots more in England and the UnitedStates than on the continent. We may recallthat when the young American, James Mc-Keen Cattell, who invented the term mental

test, came to Wundt with his own problemof individual differences, it was regarded byWundt as ganz Amerikanisch (Boring, 1950,p. 324).

The basic datum for an individual-differences approach is not anything thatcharacterizes each of two subjects, but thedifference between them. For this latter tradi-tion, it is the aggregate which is of interest,and not the general. One of the most un-fortunate characteristics of many studies inpsychology, especially in experimental psy-chology, is that the data are treated as aggre-gates while the experimenter is trying to infergeneral propositions. There is hardly an issueof most of the major psychological journalsreporting experimentation in which this con-fusion does not appear several times; and inwhich the test of significance, which has somevalue in connection with the study of ag-gregates, is not interpreted as a measure ofthe credibility of the general proposition inwhich the investigator is interested.

The distinction between the aggregate andthe general may be illuminated by a smallmathematical exercise. The methods of analy-sis of variance developed by Fisher and hisschool have become techniques of choiceamong psychologists. However, at root, themethods of analysis of variance do not dealwith that which any two or more subjectsmay have in common, but consider only dif-ferences between scores. This is all that isanalyzed by analysis of variance. The follow-ing identity illustrates this clearly, showingthat the original total sum squares, of whicheverything else in any analysis of varianceis simply the partitioning of, is based on theliteral difference between each pair of scores(cf. Bakan, 1955). Except for n, it is theonly information used from the data:

i-l

n-L 1

n)']*

J

Thus, what took place historically in psy-chology is that instead of attempting to

Page 12: DECEMBER 1966 Psychological Bulletinstats.org.uk/statistical-inference/Bakan1966.pdfVOL. 66, No. 6 DECEMBER 1966 Psychological Bulletin THE TEST OF SIGNIFICANCE IN PSYCHOLOGICAL RESEARCH

434 DAVID BAKAN

synthesize the two traditional approaches topsychological phenomena, which is both pos-sible and desirable, a syncretic combinationtook place of the methods appropriate to thestudy of aggregates with the aims of a psy-chology which sought for general proposi-tions. One of the most overworked terms,which added not a little to the essential con-fusion, was the term "error," which was akind of umbrella term for (at the least)variation among scores from different indi-viduals, variation among measurements forthe same individual, and variation amongsamples.

Let us add another historical note. In 1936,Guilford published his well-known Psycho-metric Methods. In this book, which becamea kind of "bible" for many psychologists,he made a noble effort at a "Rapprochementof Psychophysical and Test Methods" (p. 9).He observed, quite properly, that mathemati-cal developments in each of the two fieldsmight be of value in the other, that "Bothpsychophysics and mental testing have restedupon the same fundamental statistical devices[p. 9]." There is no question of the truthof this. However, what he failed to emphasizesufficiently was that mathematics is so ab-stract that the same mathematics is applica-ble to rather different fields of investigationwithout there being any necessary furtheridentity between them. (One would not, forexample, argue that business and genetics areessentially the same because the same arith-metic is applicable to market research and inthe investigation of the facts of heredity.)A critical point of contact between the twotraditions was in connection with scaling inwhich Cattell's principle that "equally oftennoticed differences are equal unless always ornever noticed [Guilford, 1936, p. 217]" wasadopted as a fundamental assumption. The"equally often noticed differences" is, ofcourse, based on aggregates. By means ofthis assumption, one could collapse the dis-tinction between the two areas of investiga-tion. Indeed, this is not really too bad ifone is alert to the fact that it is an assump-tion, one which even has considerable prag-matic value. As a set of techniques wherebydata could be analyzed, that is, as a set oftechniques whereby one could describe one's

findings, and then make inductions about thenature of the psychological phenomena, thatwhich Guilford put together in his book waseminently valuable. However, around thistime the work of Fisher and his school wascoming to the attention of psychologists. Itwas attractive for several reasons. It offeredadvice for handling "small samples." It of-fered a number of eminently ingenious newways of organizing and extracting informa-tion from data. It offered ways by whichseveral variables could be analyzed simul-taneously, away from the old notion that onehad to keep everything constant and varyonly one variable at a time. It showed howthe effect of the "interaction" of variablescould be assessed. But it also claimed to havemathematized induction I The Fisher ap-proach was thus "bought," and psychologistsgot a theory of induction in the bargain, atheory which seemed to exhaust the induc-tive processes. Whereas the question of the"reliability" of statistics had been a matterof concern for some time before (althoughfrequently very garbled), it had not carriedthe burden of induction to the degree thatit did with the Fisher approach. With the"buying" of the Fisher approach the psycho-logical research worker also bought, and thenoverused, the test of significance, employingit as the measure of the significance, in thelargest sense of the word, of his researchefforts.

SHARP AND LOOSE NULL HYPOTHESES

Earlier, a distinction was made betweensharp and loose null hypotheses. One of themajor difficulties associated with the Fisherapproach is the problem presented by sharpnull hypotheses; for, as we have already seen,there is reason to believe that the existenceof sharp null hypotheses is characteristicallyunlikely. There have been some efforts tocorrect for this difficulty by proposing theuse of loose null hypotheses; in place of asingle point, a region being considered null.Hodges and Lehmann (1954) have pro-posed a distinction between "statistical sig-nificance," which entails the sharp hypothe-sis, and "material significance," in which onetests the hypothesis of a deviation of astated amount from the null point instead

Page 13: DECEMBER 1966 Psychological Bulletinstats.org.uk/statistical-inference/Bakan1966.pdfVOL. 66, No. 6 DECEMBER 1966 Psychological Bulletin THE TEST OF SIGNIFICANCE IN PSYCHOLOGICAL RESEARCH

TEST OF SIGNIFICANCE IN PSYCHOLOGICAL RESEARCH 435

of the null point itself. Edwards (1950, pp.30-31) has suggested the notion of "practicalsignificance" in which one takes into accountthe meaning, in some practical sense, of themagnitude of the deviation from null togetherwith the number of observations which havebeen involved in getting statistical signifi-cance. Binder (1963) has equally argued thata subset of parameters be equated with thenull hypothesis. Essentially what has beensuggested is that the investigator make somekind of a decision concerning "How much,say, of a difference makes a difference?" Thedifficulty with this solution, which is cer-tainly a sound one technically, is that inpsychological research we do not often havevery good grounds for answering this ques-tion. This is partly due to the inadequaciesof psychological measurement, but mostly dueto the fact that the answer to the questionof "How much of a difference makes a dif-ference?" is not forthcoming outside of someparticular practical context. The questioncalls forth another question, "How much ofa difference makes a difference for whatt"

DECISIONS VERSUS ASSERTIONS

This brings us to one of the major issueswithin the field of statistics itself. The prob-lems of the research psychologist do not gen-erally lie within practical contexts. He israther interested in making assertions con-cerning psychological functions which have areasonable amount of credibility associatedwith them. He is more concerned with "Whatis the case?" than with "What is wise todo?" (cf. Rozeboom, 1960).

It is here that the decision-theory approachof Neyman, Pearson, and Wald (Neyman,1937, 19S7; Neyman & Pearson, 1933;Wald, 1939, 19SO, 19S5) becomes relevant.The decision-theory school, still basing itselfon some basic notions of the Fisher ap-proach, deviated from it in several respects:

1. In Fisher's inference model, the twoalternatives between which one chose on thebasis of an experiment were reject and in-conclusive. As he said in The Design ofExperiments (1947), "the null hypothesis isnever proved or established, but is possiblydisproved, in the course of experimentation[p. 16]." In the decision-theory approach,

the two alternatives are rather reject andaccept.

2. Whereas in the Fisher approach theinterpretation of the test of significancecritically depends on having one sample froma hypothetical population of experiments, thedecision-theory approach conceives of, is ap-plicable to, and is sensible with respect tonumerous repetitions of the experiment.

3. The decision-theory approach added thenotions of the Type II error (which can bemade only if the null hypothesis is accepted)and power as significant features of theirmodel.

4. The decision-theory model gave a sig-nificant place to the matter of what is con-cretely lost if an error is made in the prac-tical context, on the presumption that acceptentailed one concrete action, and reject an-other. It is in these actions and their conse-quences that there is a basis for decidingon a level of confidence. The Fisher approachhas little to say about the consequences.

As it has turned out, the field of applica-tion par excellence for the decision-theoryapproach has been the sampling inspection ofmass-produced items. In sampling inspection,the acceptable deviation from null can bespecified; both accept and reject are appro-priate categories; the alternative courses ofaction can be clearly specified; thereis a definite measure of loss for each pos-sible action; and the choice can be regardedas one of a series of such choices, so thatone can minimize the overall loss (cf.Barnard, 1954). Where the aim is only theacquisition of knowledge without regard toa specific practical context, these conditionsdo not often prevail. Many psychologists wholearned about analysis of variance frombooks such as those by Snedecor (1946)found the examples involving log weights,etc., somewhat annoying. The decision-theoryschool makes it clear that such practical con-texts are not only "examples" given for peda-gogical purposes, but actually are essentialfeatures of the methods themselves.

The contributions of the decision-theoryschool essentially revealed the intrinsic natureof the test of significance beyond that seenby Fisher and his colleagues. They demon-strated that the methods associated with the

Page 14: DECEMBER 1966 Psychological Bulletinstats.org.uk/statistical-inference/Bakan1966.pdfVOL. 66, No. 6 DECEMBER 1966 Psychological Bulletin THE TEST OF SIGNIFICANCE IN PSYCHOLOGICAL RESEARCH

436 DAVID BAKAN

test of significance constitute not an assertion,or an induction, or a conclusion calculus, buta decision- or risk-evaluation calculus. Fisher(19SS) has reacted to the decision-theoryapproach in polemic style, suggesting that itsadvocates were like "Russians [who] aremade familiar with the ideal that research inpure science can and should be geared totechnological performance, in the comprehen-sive organized effort of a five-year plan forthe nation." He also suggested an American"ideological" orientation: "In the U. S. alsothe great importance of organized technologyhas I think made it easy to confuse theprocess appropriate for drawing correct con-clusions, with those aimed rather at, let ussay, speeding production, or saving money[p. 70] ."6 But perhaps a more reasonableway of looking at this is to regard thedecision-theory school to have explicated whatwas already implicit in the work of the Fisherschool.

CONCLUSION

What then is our alternative, if the test ofsignificance is really of such limited appropri-ateness as has been indicated? At the veryleast it would appear that we would be muchbetter off if we were to attempt to estimatethe magnitude of the parameters in the popu-lations; and recognize that we then need tomake other inferences concerning the psycho-logical phenomena which may be manifestingthemselves in these magnitudes. In terms ofa statistical approach which is an alternative,the various methods associated with the theo-rem of Bayes which was referred to earliermay be appropriate; and the paper by Ed-wards et al. (1963) and the book by Schlaifer(1959) are good starting points. However,that which is expressed in the theorem ofBayes alludes to the more general process ofinducing propositions concerning the non-manifest (which is what the population isa special instance of) and ascertaining theway in which that which is manifest (whichthe sample is a special instance of) bearson it. This is what the scientific method hasbeen about for centuries. However, if thereader who might be sympathetic to the con-siderations set forth in this paper quickly

8 For a reply to Fisher, see Pearson (1955).

goes out and reads some of the material onthe Bayesian approach with the hope thatthereby he will find a new basis for automaticinference, this paper will have misfired, andhe will be disappointed.

That which we have indicated in this paperin connection with the test of significance inpsychological research may be taken as aninstance of a kind of essential mindlessness inthe conduct of research which may be, as theauthor has suggested elsewhere (Bakan,1965), related to the presumption of the non-existence of mind in the subjects of psycho-logical research, Karl Pearson once indicatedthat higher statistics were only common sensereduced to numerical appreciation. However,that base in common sense must be main-tained with vigilance. When we reach a pointwhere our statistical procedures are substi-tutes instead of aids to thought, and we areled to absurdities, then we must return tothe common sense basis. Tukey (1962) hasvery properly pointed out that statisticalprocedures may take our attention away fromthe data, which constitute the ultimate basefor any inferences which we might make.Robert Schlaifer (1959, p. 654) has dubbedthe error of the misapplication of statisticalprocedures the "error of the third kind," themost serious error which can be made. Berk-son has suggested the use of "the interoculartraumatic test, you know what the data meanwhen the conclusion hits you between the eyes[Edwards et al., 1963, p. 217]." We mustovercome the myth that if our treatment ofour subject matter is mathematical it is there-fore precise and valid. Mathematics can serveto obscure as well as reveal.

Most importantly, we need to get on withthe business of generating psychological hy-potheses and proceed to do investigations andmake inferences which bear on them; insteadof, as so much of our literature would attest,testing the statistical null hypothesis in anynumber of contexts in which we have everyreason to suppose that it is false in the firstplace.

REFERENCES

BAKAN, D. Learning and the principle of inverseprobability. Psychological Review, 1953, 60, 360-370.

Page 15: DECEMBER 1966 Psychological Bulletinstats.org.uk/statistical-inference/Bakan1966.pdfVOL. 66, No. 6 DECEMBER 1966 Psychological Bulletin THE TEST OF SIGNIFICANCE IN PSYCHOLOGICAL RESEARCH

TEST OF SIGNIFICANCE IN PSYCHOLOGICAL RESEARCH 437

BAKAN, D. The general and the aggregate: Amethodological distinction. Perceptual and MotorSkills, 1955, 5, 211-212.

BAKAN, D. Clinical psychology and logic. AmericanPsychologist, 1956, 11, 655-662.

BAKAN, D. The mystery-mastery complex in con-temporary psychology. American Psychologist,1965, 20, 186-191.

BARNARD, G. A. Sampling inspection and statisticaldecisions. Journal of the Royal Statistical Society(B), 1954, 16, 151-165.

BERKSON, J. Some difficulties of interpretation en-countered in the application of the chi-square test.Journal of the American Statistical Association,1938, 33, 526-542.

BERKSON, J. Tests of significance considered as evi-dence. Journal of the American Statistical Associa-tion, 1942, 37, 325-335.

BINDER, A. Further considerations on testing the nullhypothesis and the strategy and tactics of investi-gating theoretical models. Psychological Review,1963, 70, 101-109.

BOIXES, R. C. The difference between statistical hy-potheses and scientific hypotheses. PsychologicalReports, 1962, 11, 639-645.

BORING, E. G. A history of experimental psychology.(2nd ed.) New York: Appleton-Century-Crofts,1950.

COHEN, J. The statistical power of abnormal-socialpsychological research: A review. Journal of Ab-normal and Social Psychology, 1962, 65, 145-153.

CRONBACH, L. J. The two disciplines of scientificpsychology. American Psychologist, 1957, 12, 671-684.

EDWARDS, A. L. Experimental design in psychologicalresearch. New York: Rinehart, 1950.

EDWARDS, W., LINDMAN, H., & SAVAGE, L. J. Bayesianstatistical inference for psychological research.Psychological Review, 1963, 70, 193-242.

FERGUSON, L. Statistical analysis in psychology andeducation. New York: McGraw-Hill, 1959.

FISHER, R. A. The design of experiments. (4th ed.)Edinburgh: Oliver & Boyd, 1947.

FISHER, R. A. Statistical methods and scientificinduction. Journal of the Royal Statistical Society(B), 1955, 17, 69-78.

GUILFORD, J. P. Psychometric methods. New York:McGraw-Hill, 1936.

HODGES, J. L., & LEHMAN, E. L. Testing the ap-proximate validity of statistical hypotheses. Jour-nal of the Royal Statistical Society (B), 1954,16, 261-268.

HOGBEN, L. The relationship of probability, credi-bility and error: An examination of the contempo-rary crisis in statistical theory from a behaviouristviewpoint. New York: Norton, 1958.

KAISER, H. F. Directional statistical decision. Psycho-logical Review, 1960, 67, 160-167.

KEYNES, J. M. A treatise on probability. London:Macmillan, 1948.

LACEY, O. L. Statistical methods in experimentation.New York: Macmillan, 1953.

LINDQUIST, E. F. Statistical analysis in educationalresearch. Boston: Houghton Mifflin, 1940.

MCNEMAR, Q. At random: Sense and nonsense.American Psychologist, 1960, IS, 295-300.

MELTON, A. W. Editorial. Journal of ExperimentalPsychology, 1962, 64, 553-557.

NEYMAN, J. Outline of a theory of statistical estima-tion based on the classical theory of probability.

Philosophical Transactions of the Royal Society(A), 1937, 236, 333-380.

NEYMAN, J. "Inductive behavior" as a basic conceptof philosophy of science. Review of the Mathe-matical Statistics Institute, 1957, 25, 7-22.

NEYMAN, J., & PEARSON, E. S. On the problem ofthe most efficient tests of statistical hypotheses.Philosophical Transactions of the Royal Society(A), 1933, 231, 289-337.

NUNNAIXY, J. The place of statistics in psychology.Education and Psychological Measurement, 1960,20, 641-650.

PEARSON, E. S. Statistical concepts in their relationto reality. Journal of the Royal Statistical Society(B), 1955, 17, 204-207.

ROSENTHAL, R., & GAiTO, J. The interpretation oflevels of significance by psychological researchers.Journal of Psychology, 1963, 55, 33-38.

ROZEBOOM, W. W. The fallacy of the null-hypothesis significance test. Psychological Bul-letin, 1960, 57, 416-428.

SAVAGE, L. J. The foundations of statistics. NewYork: Wiley, 1954.

SCHLAIPER, R. Probability and statistics for businessdecisions. New York: McGraw-Hill, 1959.

SMITH, C. A. B. Review of N. T. J. Bailey, Sta-tistical methods in biology. Applied Statistics, 1960,9, 64-66.

SNEDECOR, G. W. Statistical methods. (4th ed.; orig,publ. 1937) Ames, Iowa: Iowa State College Press,1946.

TUKEY, J. W. The future of data analysis. Annals ofMathematical Statistics, 1962, 33, 1-67.

UNDERWOOD, B. J., DUNCAN, C. P., TAYLOR, J. A., &COTTON, J. W. Elementary statistics. New York:Appleton-Century-Crofts, 1954.

WALD, A. Contributions to the theory of statisticalestimation and testing hypotheses. Annals ofMathematical Statistics, 1939, 10, 299-326.

WALD, A, Statistical decision functions. New York:Wiley, 1950.

WALD, A. Selected papers in statistics and probabil-ity. New York: McGraw-Hill, 1955.

WILSON, K. V. Subjectivist statistics for the currentcrisis. Contemporary Psychology, 1961, 6, 229-231.

(Received June 30, 1965)