graphical methods for investigating the size and power of hypothesis tests

GRAPHICAL METHODS FOR INVESTIGATINGTHE SIZE AND POWER OF HYPOTHESIS TESTS*

byRUSSELL DAVIDSON

Queen's University, Ontario, and GREQAM, Marseilleand

JAMES G. MacKINNON{Queen's University, Ontario

Simple techniques for the graphical display of simulation evidenceconcerning the size and power of hypothesis tests are developed andillustrated. Three types of ¢guresöcalled P value plots, P valuediscrepancy plots and size^power curvesöare discussed. Some MonteCarlo experiments on the properties of alternative forms of theinformation matrix test for linear regression models and probit modelsare used to illustrate these ¢gures. Tests based on the outer-product-of-the-gradient (OPG) regression generally perform much worse interms of both size and power than e¤cient score tests.

" Introduction

To obtain evidence on the ¢nite-sample properties of hypothesis testingprocedures, econometricians generally resort to simulation methods. As aresult many, if not most, of the papers that deal with speci¢cation testingand other forms of hypothesis tests include some Monte Carlo results. It isoften a challenge to present these results in a form that is reasonablycompact and easy to comprehend. In this paper we discuss some simplegraphical methods that can be very useful for presenting simulation resultson the size and the power of test statistics. The graphs convey much moreinformation, in a more easily assimilated form, than tables can do.

Consider a Monte Carlo experiment in which N realizations of sometest statistic t are generated using a data-generating process (DGP) that isa special case of the null hypothesis. We may denote these simulated valuesby tj; j � 1; . . . ;N. Unless t is extraordinarily expensive to compute (as itmay be if bootstrapping is involved; see Section 3), N will generally be alarge number, probably 5000 or more. In practice, of course, severaldi¡erent test statistics may be generated on each replication.

ß Blackwell Publishers Ltd and The Victoria University of Manchester, 1998.Published by Blackwell Publishers Ltd, 108 Cowley Road, Oxford OX4 1JF, UK, and 350 Main Street, Malden, MA 02148, USA.

1

The Manchester School Vol 66 No. 1 January 19980025^2034 1^26

*Manuscrupt received 14.3.97; ¢nal version received 1.5.97.{This research was supported, in part, by the Social Sciences and Humanities Research

Council of Canada. We are grateful to Chris Orme, one referee and participants inworkshops at Cornell University, York University, Queen's University, the StateUniversity of New York at Albany, GREQAM, the Statistical Society of Canada andthe Canadian Econometric Study Group for helpful comments on earlier versions.

The conventional way to report the results of such an experiment isto tabulate the proportion of the time that tj exceeds one or more criticalvalues, such as the 1, 5 and 10 per cent values for the asymptoticdistribution of t. This approach has at least two serious disadvantages.First, the tables provide information about only a few points on the ¢nite-sample distribution of t; as we shall see in Section 4, this can be animportant limitation. Second, the tables require some e¡ort to interpret,and they generally do not make it easy to see how changes in the samplesize, the number of degrees of freedom and other factors a¡ect test size. Inthis paper, we advocate graphical methods that provide more information,are easy to implement and yield graphs that are easy to interpret.

The plan of the paper is as follows. In the next section, we discussthe graphs that we are proposing for experiments concerning test size.Then, in Section 3, we brie£y discuss several forms of the informationmatrix (IM) test statistic, which was proposed originally by White (1982),for the normal linear regression model and the probit model. In Section 4,we present a number of Monte Carlo results on various forms of the IMtest to illustrate the use of P value plots and P value discrepancy plots. InSection 5, we discuss and illustrate methods for smoothing P valuediscrepancy plots. Finally, in Section 6, we discuss and illustrate the use ofsize^power curves.

Some of the results on IM tests in Sections 4 and 6 are new andinteresting. In particular, we ¢nd that bootstrapping IM tests in probitmodels dramatically improves their performance under the null. We also¢nd that, for both linear regression models and probit models, the outer-product-of-the-gradient (OPG) form of the IM test generally hassubstantially lower power than other forms of the test, when all the formsare properly size-corrected.

á P Value Plots and P Value Discrepancy Plots

All the graphs we discuss are based on the empirical distribution function(EDF) of the P values of the tj. The P value of tj is the probability ofobserving a value of t as extreme as or more extreme than tj, according tosome distribution F�t�. This distribution could be the asymptotic distri-bution of t, or it could be a distribution derived by bootstrapping, or itcould be an approximation to the (generally unknown) ¢nite-sampledistribution of t. For notational simplicity, we shall assume that there isonly one P value associated with tj, namely, pj � p�tj�. Precisely how pj isde¢ned will vary. For example, if t is asymptotically distributed as w2�r�and Fw2�x; r� denotes the cumulative distribution function (c.d.f.) of thew2�r� distribution evaluated at x, then pj � 1ÿ Fw2 �tj; r�.

The EDF of the pj is simply an estimate of the c.d.f. of p�t�. At anypoint xi in the (0, 1) interval, it is de¢ned by

2 The Manchester School

ß Blackwell Publishers Ltd and The Victoria University of Manchester, 1998.

F̂�xi� �1N

XN

j�1I�pj � xi� �1�

where I�pj � xi� is an indicator function that takes the value 1 if itsargument is true and 0 otherwise. The EDF (1) is often evaluated at everydata point. However, this is unnecessary. When N is large, as it often willbe, storage space can be conserved by evaluating the EDF only at m pointsxi; i � 1; . . . ;m, which should be chosen in advance so as to provide areasonable snapshot of the (0, 1) interval, or of that part of it which is ofinterest.

It is di¤cult to state categorically how large m should be and howthe xi should be chosen. A quite parsimonious way to choose the xi is

xi � 0:002; 0:004; . . . ; 0:01; 0:02; . . . ; 0:99; 0:992; . . . ; 0:998 �m � 107��2�

Another choice, which may give slightly better results, is

xi � 0:001; 0:002; . . . ; 0:010; 0:015; . . . ; 0:990; 0:991; . . . ; 0:999 �m � 215��3�

For both (2) and (3), there are extra points near 0 and 1 in order to ensurethat we do not miss any unusual behaviour in the tails. As we shall see inSection 6, it may be necessary to add additional points in certain cases.Note that, because we evaluate the EDF only at a relatively small numberof points, it is straightforward to employ variance reduction techniques.If control or antithetic variates are available, the techniques proposed byDavidson and MacKinnon (1992b) to estimate tail areas can simply beapplied to each point on the EDF.

The simplest graph that we will discuss is a plot of F̂�xi� againstxi. We shall refer to such a graph as a P value plot. If the distributionof t used to compute the pj is correct, each of the pj should bedistributed as uniform (0, 1). Therefore, when F̂�xi� is plotted against xi,the resulting graph should be close to the 458 line. As we shall see inSection 4, P value plots allow us to distinguish at a glance among teststatistics that systematically over-reject, test statistics that systematicallyunder-reject, and test statistics that reject about the right proportion ofthe time. However, because P value plots for all test statistics thatbehave approximately the way they should will look roughly like 458lines, these plots are not very useful for distinguishing between such teststatistics.

For dealing with test statistics that are well behaved, it is much morerevealing to plot F̂�xi� ÿ xi against xi. We shall refer to such a graph as a P

value discrepancy plot. These plots have advantages and disadvantages.They convey a lot more information than P value plots for test statisticsthat are well behaved. However, some of this information is spurious,

Graphical Methods for Investigating Hypothesis Tests 3


simply re£ecting experimental randomness. In Section 5, we thereforediscuss semi-parametric methods for smoothing them. Moreover, becausethere is no natural scale for the vertical axis, P value discrepancy plots canbe harder to interpret than P value plots.

P value plots and P value discrepancy plots are very useful for dealingwith test size, but not useful for dealing with test power. In Section 6, wediscuss graphical methods for comparing the power of competing testsusing size^power curves. These curves can be constructed using two EDFs,one for an experiment in which the null hypothesis is true, and one foran experiment in which it is false.

It would be more conventional to graph the EDF of the tj instead ofthe EDF of their P values. If the former were graphed against ahypothesized distribution, the result would be what is often called a PPplot; see Wilk and Gnanadesikan (1968). Plotting P values makes it mucheasier to interpret the plots, since what they should look like will notdepend on the null distribution of the test statistic. This fact also makes iteasy to compare test statistics which have di¡erent distributions underthe null, and to compare di¡erent procedures for making inferences fromthe same test statistics. Note that P value plots are really just PP plots of P

values. Contrary to what we believed initially, we are not the ¢rst toemploy them; see, for example, Fisher et al. (1994). There are also somerecent applications in econometrics, which generally cite the discussionpaper version of this paper; see, for example, Hansen et al. (1996) andWest and Wilcox (1996).

Another common type of plot is a QQ plot, in which the empiricalquantiles of the tj are plotted against the actual quantiles of their hypo-thesized distribution. If the empirical distribution is close to thehypothesized one, the plot will be close to the 458 line. This approach,which has been used by Chesher and Spady (1991), among others, canyield useful information. If the plot is linear, then a suitable change ofscale is all that is needed to make the test statistic perform as it should.This would suggest that a Bartlett-type correction may be appropriate (seeCribari-Neto and Cordeiro, 1996; Chesher and Smith, 1997). On the otherhand, if the plot is non-linear, such a correction will not be adequate. ThusQQ plots can fruitfully be employed to see whether Bartlett-typecorrections are worth investigating.

However, QQ plots also have disadvantages. One serious problem isthat a QQ plot has no natural scale for the axes. Thus, if the hypothesizeddistribution changes, so will that scale. As we shall see in Section 4, thismakes it di¤cult to plot on the same axes test statistics that have di¡erentdistributions under the null. One way to get around thisösee, for example,Mammen (1992)öis to employ a QQ plot of P values. This will conveyexactly the same information as a P value plot, except that the plot will bere£ected about the 458 line. In our view, however, QQ plots of P values



are harder to read than P value plots, because it is more natural to havenominal test size on the horizontal axis.

Of course, graphical methods by themselves are not always enough.When test performance depends on a number of factors, the two-dimensional nature of most graphs and all tables can be limiting. In suchcases, it may be desirable to supplement the graphs with estimatedresponse surfaces which relate size or power to sample size, parametervalues and so on; see, for example, Hendry (1984).

â Alternative Forms of the Information Matrix Test

In order to illustrate and motivate P value and P value discrepancy plots,we use them to present the results of a study of the properties ofalternative forms of the IM test. We compare tests based on the OPGregression, which was proposed as a way to compute IM tests by Chesher(1983) and Lancaster (1984), with two other forms of the IM test,including one proposed by us in Davidson and MacKinnon (1992a).Table 1 of that paper, which occupies a page and a half, is a particularlystriking example of the disadvantages of presenting simulation results fortest statistics in a non-graphical way. Even very recent papers about theIM test often fail to use graphical methods; one paper that, in our view,would bene¢t greatly from doing so is Cribari-Neto (1997).

We deal with two models, the univariate linear regression model withnormal errors, and the probit model. In the former case, IM tests arepivotal, which means that their distributions do not depend on anynuisance parameters, and bootstrapping will therefore work perfectly. Inthe latter case, IM tests are not pivotal, so that bootstrapping will notwork perfectly. However, because they are asymptotically pivotal,bootstrapping should yield inferences that are accurate to higher order, inthe sample size, than the inferences provided by asymptotic theory (seeBeran, 1988; Horowitz, 1994; Davidson and MacKinnon, 1996a).

Three of the variants of the IM test that we shall deal with pertainto the normal linear regression model

yt � d1 �Xk

i�2diXti � ut ut � NID�0; s2� t � 1; . . . ; n �4�

The regressors Xti are normal random variables, independent acrossobservations, and equicorrelated with correlation coe¤cient 1/2. Becauseall versions of the IM test for this model are pivotal, the values of s andthe di are chosen arbitrarily.

The OPG variant of the IM test statistic is obtained by regressing ann-vector of 1s on ~utXti, for i � 1; . . . ; k, and on 1

2 �k2 � 3k� test regressors.These test regressors are functions of ~et � ~ut= ~s and the Xti. Here ~ut is the



tth ordinary least squares (OLS) residual and ~s2 � �1=n�Pn

t�1 ~u2t . There

are k�k� 1�=2ÿ 1 test regressors of the form � ~e2t ÿ 1�XtiXt j, which test forheteroscedasticity, k regressors of the form � ~e3t ÿ 3 ~et�Xti, which test forskewness interacting with the regressors, and one regressor of the form~e4t ÿ 5 ~e2t � 2, which tests for kurtosis. The test statistic is n minus the sumof squared residuals from the regression, and it is asymptoticallydistributed as w2�12 �k2 � 3k��.

The double-length regression (DLR) form of the IM test is a bit morecomplicated. It involves a double-length arti¢cial regression with 2n`observations', and the test statistic is 2n minus the sum of squaredresiduals from this regression. The number of regressors is the same as forthe OPG test, and so is the asymptotic distribution of the test statistic.See Davidson and MacKinnon (1984, 1992a).

A third form of the IM test, which is less widely available than theother two, is the e¤cient score (ES) form. The ES form of the Lagrangemultiplier test is often considered to have optimal or nearly optimalproperties, because the only random quantities in the estimate of the IMare the restricted maximum likelihood (ML) parameter estimates. In thiscase, the ES form of the IM test is actually the sum of three teststatistics:

12 h02Z�Z 0Z �ÿ1Z 0h2 � 1

6 h03X�X 0X�ÿ1X 0h3 �

124n

Xn

t�1� ~e4t ÿ 3�2 �5�

where X has typical element Xti, Z has typical element XtiXt j, h2 hastypical element ~e2t ÿ 1, and h3 has typical element ~e3t . These three statisticstest for heteroscedasticity, skewness and kurtosis, respectively (see Hall,1987).

The other two variants of the IM test that we shall deal with pertainto the probit model. Consider any binary response model of the form

E�yt j Ot� � D�X tb� t � 1; . . . ; n �6�where yt is a 0^1 dependent variable, Ot denotes an information set, D�:�is a twice continuously di¡erentiable, increasing function that maps fromthe real line to the 0^1 interval, X t is a k� 1 row vector of observations onvariables that belong to Ot, and b is a k-vector of unknown parameters.The contribution to the log-likelihood by observation t is

`t � yt log D�X tb� � �1ÿ yt� log�1ÿ D�X tb�� 7�and the derivative with respect to bi is

@`t

@bi

� Xtidt�yt ÿ Dt�Dt�1ÿ Dt�

�8�

where dt � d�X tb� is the derivative of Dt � D�X tb�. After some algebra,the derivative of (8) with respect to bj can be seen to be



XtiXt j

D2t �1ÿ Dt�2

��yt ÿ Dt�d 0tDt�1ÿ Dt� ÿ d 2t �yt ÿ Dt�2� �9�

where d 0t � d 0�X tb� is the derivative of dt. Thus the �i; j� direction of theIM test is given by

@2`t

@bi@bj

� @`t

@bi

@`t

@bj

� XtiXt j

Dt�1ÿ Dt�d 0tdt

�yt ÿ Dt� �10�

The OPG regression simply involves regressing a vector of 1s on k

regressors with typical element (8) and on k�k� 1�=2ÿ 1 test regressorswith typical element (10), where all regressors are evaluated at the MLestimates ~b. The test statistic is n minus the sum of squared residuals fromthis regression. The number of test regressors is normally one less thank�k� 1�=2 because, if there is a constant term, the test regressor for whichi and j both index the constant will be perfectly collinear with the regressorfor the constant from (8).

It is also quite easy to derive an ES variant of the IM test for binaryresponse models (see Orme, 1988). Consider the arti¢cial regression

~wt�yt ÿ ~Dt� � ~wt~dtX tb� ~wt

~dt� ~d 0t= ~dt�Wtc� residual �11�where ~Dt � Dt�X t

~b�, ~dt � dt�X t~b�, ~d 0t � d 0�X t

~b�, ~wt � � ~Dt�1ÿ ~Dt��ÿ1=2 andWt is a vector with typical element XtiXt j. By standard results for arti¢cialregressionsösee Davidson and MacKinnon (1993, ch. 15)öthe explainedsum of squares from regression (11) is a test statistic for c � 0, and from(10) it is clear that this test statistic is testing in the directions of the IMtest. That this test statistic is the ES variant is evident from the fact thatnone of the regressors in (11) depends directly on yt; their only dependenceon the data is through the ML estimates ~b.

In our experiments, we deal with a probit model that has betweentwo and ¢ve regressors, one of them a constant term and the restindependent N(0, 1) random variates. For this model, D�X tb� is thecumulative standard normal distribution function F�X tb�, d�X tb� is thestandard normal density f�X tb�, and d 0t=dt � ÿX tb.

In the case of the probit model, (11) simpli¢es slightly to

~wt�yt ÿ ~Ft� � ~wt~ftX tb� ~wt

~ft�ÿX t~b�Wtc� residual �12�

This arti¢cial regression is precisely the one for testing the hypothesis thatc � 0 in the model

E�yt j Ot� � FX tb

exp�Wtc��

�13�

This model involves a form of heteroscedasticity, and it reduces to theordinary probit model when c � 0. Thus we see that, for the probit model,the IM test is equivalent to a test for a certain type of heteroscedasticity.



It is well known that many forms of the IM statistic, most notoriouslythe OPG form, have very poor ¢nite-sample properties under the null.See, among others, Taylor (1987), Chesher and Spady (1991), Davidsonand MacKinnon (1992a) and Horowitz (1994). One way to improve these¢nite-sample properties is to use the parametric bootstrap. This idea wasinvestigated by Horowitz (1994), who found that it worked very well forIM tests in probit and tobit models. Horowitz used the bootstrap tocompute critical values. An alternative approach, which we prefer, is touse it to compute P values. After a test statistic, say t̂, has been obtained,B sets of simulated data are generated from a DGP that satis¢es the nullhypothesis, using the parameter estimates from the actual sample, and B

bootstrap test statistics are computed. Suppose that B� of these teststatistics are greater than t̂, or greater in absolute value if it is a two-tailedtest. Then the estimated P value for t̂ is B�=B.

In the case of the linear regression model (4), all forms of the IM teststatistic are pivotal. This implies that, as B!1, the EDF of thebootstrapped P values must tend to the 458 line, and that the size^powercurve will be una¡ected by bootstrapping (see Beran, 1988). Therefore, inthis case, we do not need to do a Monte Carlo experiment to see how theparametric bootstrap works; except for experimental error, it will workperfectly. The error terms for the bootstrap samples must be obtainedfrom a normal distribution rather than by resampling from the residuals,since the residuals will not be normally distributed. Thus the result thatthe bootstrap will work perfectly for IM tests in linear regression models isonly for the parametric bootstrap.

Since IM tests for the probit model are only asymptotically pivotal,it is of interest to see how bootstrapping a¡ects their size and power. Theresults, which are presented in Sections 4, 5 and 6, turn out to beinteresting.

ã IM Tests under the Null

Figure 1 shows P value plots for the OPG, DLR and ES variants of theIM test for the regression case with n � 100 and k � 2. These are based onan experiment with 100,000 replications. The xi were chosen as in (3), sothat m � 215. The ¢gure makes it dramatically clear that the OPG form ofthe IM test works very badly under the null. In this case, it rejects justunder half the time at the nominal 5 per cent level. In contrast, the DLRform seems to work quite well, and the ES form over-rejects in the left tailand under-rejects elsewhere.

Figure 1 illustrates some of the advantages and disadvantages of P

value plots. On the one hand, they make it very easy to distinguish teststhat work well, such as the DLR form, from tests that work badly, such asthe OPG form. Moreover, because they show how a test performs for all



nominal sizes, P value plots are particularly useful for tests that both over-and under-reject, such as the ES form. On the other hand, P value plotsdo not make it easy to see patterns in the behaviour of tests that work well.For example, one has to look quite closely at the ¢gure to see that DLRsystematically under-rejects for small test sizes.

Another potential disadvantage of P value plots is that they can takeup a lot of space. Since, in most cases, we are primarily interested inreasonably small test sizes, it may make sense to truncate the plot at somevalue of x less than unity. Figure 2 shows two sets of P value plots, bothtruncated at x � 0:25. These plots provide a great deal of informationabout how the sample size n and the number of regressors k a¡ect theperformance of the OPG form of the IM test. From Fig. 2(a), we see that

Fi g . 1 P Value Plots for IM Tests, Regression Model, n � 100, k � 2

Actual size

Nominal size



Actual size Actual size

Nominalsize

Nominalsize

(a) (b)

Fi g . 2 P Value Plots for OPG IM Tests, Regression Model: (a) k � 2; (b) n � 200

Fi g . 3 P Value Discrepancy Plots for DLR IM Tests, Regression Model: (a) k � 2;(b) n � 200

Size discrepancy Size discrepancy

Nominalsize

Nominalsize

(a) (b)



their performance improves as n increases but is still unsatisfactory forn � 1600. From Fig. 2(b), we see that their performance deterioratesdramatically as k (and hence the number of degrees of freedom) increases.These ¢gures tell us all we really need to know about how the OPG IM testbehaves under the null.

For the DLR variant of the IM test, P value plots are not veryinformative because the tests perform so well. Figure 3 therefore providesP value discrepancy plots for this test for the same cases as Fig. 2, buttruncated at x � 0:40. It is natural to ask whether P value discrepancies,such as those in Fig. 3, can be explained by experimental randomness. TheKolmogorov^Smirnov (KS) test is often used to test whether an EDF iscompatible with some speci¢ed distribution function. For EDFs of P

values, the KS test statistic is simply

maxj�1;...;N

jF̂�pj� ÿ pjj �14�

This is almost equal to the largest absolute value of the P valuediscrepancy plot

maxi�1;...;m

jF̂�xi� ÿ xij �15�

Fi g . 4 P Value Discrepancy Plots for ES IM Tests, Regression Model: (a) k � 2;(b) n � 200


Nominalsize

Nominalsize

(a) (b)



Note, however, that expression (14) will almost always be slightly biggerthan expression (15) because the maximum is being taken over a largernumber of points. If we ignore this very slight di¡erence, there is anextremely easy way to see if observed P value discrepancies could be theresult of experimental randomness. We simply need to draw horizontallines representing the critical values for the KS test on our ¢gure. The lineslabelled KS0:05 in Fig. 3 represent the 0:05 critical values.

Figure 3(a) makes it clear that the DLR form of the IM testsystematically under-rejects for small test sizes when k � 2. However, aswe see from Fig. 3(b), things change as k increases. For n � 200, the DLRform over-rejects for all test sizes whenever k � 5. For large values of k

and small values of n, the DLR form is su¤ciently badly behaved that a P

value plot might be more appropriate than a P value discrepancy plot.Figure 4 shows P value discrepancy plots for the ES form of the IM

test. This ¢gure is comparable to Fig. 3, but the vertical scale is somewhatdi¡erent. It is very clear that the ES form over-rejects when the nominaltest size is small but under-rejects when the nominal size is large enough.These over- and under-rejections become more pronounced as n falls and k

rises, although the e¡ects of reducing n and increasing k are by no meansthe same. The nominal test size at which the test switches from over-rejection to under-rejection becomes larger as k increases.

Fi g . 5 P Value Plots for OPG IM Tests, Probit Model: (a) k � 3; (b) n � 200

Actual size Actual size

Nominalsize

Nominalsize

(a) (b)



We now turn to the results for the probit model. Figure 5 shows thatthe OPG variant of the IM test rejects far too often under the null. Thisover-rejection becomes worse as the number of regressors k increases andas the sample size n declines. It also depends on the parameters of theDGP. The ¢gure reports results from two sets of experiments, one withb0 � 0 and the other bi equal to 1, and the second with all the bi equal to 1.The performance of the OPG test is always somewhat worse in the lattercase. From the ¢gure, we see that the OPG test should never be usedwithout bootstrapping and that, because the test is de¢nitely non-pivotal,bootstrapping cannot be expected to work perfectly.

Figure 6 shows P value discrepancy plots for the ES form of the IMtest for probit models. P value plots would also be reasonably informativein this case, but they would be hard to decipher in the left-hand tail, whichis the region of greatest interest. To keep the ¢gure readable, only resultsfor the b0 � 0 case are shown. The comparable ¢gure for b0 � 1, which isnot shown, looks similar but not identical to this one. Like Fig. 5, this¢gure is based on 100,000 replications. The striking feature of Fig. 6 is thatthe ES test rejects too often for small test sizes and not nearly oftenenough for larger sizes. When k � 3, tests at the 0.05 level perform quite

Fi g . 6 P Value Discrepancy Plots for ES IM Tests, Probit Model: (a) k � 3; (b) n � 200


Nominalsize

Nominalsize

(a) (b)



well, especially for n � 50 and n � 100. But the test over-rejects quiteseverely at the 0.01 and 0.001 levels, and its performance at the 0.05 levelactually deteriorates as n increases. Thus, reporting results only for the0.05 level could be very misleading.

By this point, it should be clear why P value plots and P valuediscrepancy plots are often much more informative than tables of rejectionfrequencies at conventional signi¢cance levels like 0.05. First, there isnothing special about the 0.05 level. Some investigators may prefer to usethe 0.01 level or even the 0.001 level, while pre-tests are often performed atlevels of 0.25 or even higher. Moreover, if investigators are simply lookingat P values rather than testing at some speci¢ed level, they will generallyplace much more weight on a P value of 0.001 than on a P value of 0.049.This can be a dangerous thing to do.

Another reason for being concerned with the entire distribution of a

Fi g . 7 QQ Plots for ES IM Tests, Regression Model, n � 200

Observed quantiles

Theoretical quantiles



test statistic, or at least the left-hand part of it, is that, in our experience,a test statistic that is seriously unreliable at any level for somecombination of parameter values and sample size is likely to be seriouslyunreliable at the 0.05 level for some other combination of parametervalues and sample size. Consider Figs 4 and 6. Thus serious P valuediscrepancies at any level should be a cause for concern.

As we discussed above, an alternative graphical technique, which haslong been used in the statistics literature, is the QQ plot. Figure 7 showsQQ plots for the same experiments as Fig. 4(b), i.e. for the regressionmodel ES IM test with 200 observations and several values of k, thenumber of regressors. Since these experiments had 100,000 replications,the plots are based only on the 215 quantiles given in (3). It is instructiveto compare Fig. 7 with Fig. 4(b). Although the QQ plots in Fig. 7 certainlymake it clear that the ES IM test does not perform perfectly, this ¢gureprovides much less useful information, and takes up much more space,than does Fig. 4(b). It is extremely di¤cult, on the basis of Fig. 7, to seehow the performance of the ES IM test changes with k, something that isimmediately obvious from the P value discrepancy plots of Fig. 4(b).

Of course, because they use one dimension for nominal test size, P

value plots and P value discrepancy plots cannot use that dimension torepresent something else, such as the value of some parameter or thesample size. In consequence, there are bound to be cases in which othertypes of plots are equally or more informative. There may well beoccasions when, like residual plots for regressions, P value plots providevaluable information to investigators but do not convey it succinctlyenough to merit publication.

ä Smoothing P Value Discrepancy Plots

Even though Fig. 3 is based on 100,000 replications, it is somewhat jaggedas a consequence of experimental randomness. When P value discrepancyplots are based on more modest numbers of replications, as will generallybe the case when test statistics are expensive to compute, perhaps becausethey involve bootstrapping, these plots can be quite jagged. Therefore, it isnatural to think about how to obtain smoother plots that may be easierto interpret.

One way to smooth a P value discrepancy plot is to regress thediscrepancies on smooth functions of xi, such as polynomials or trigono-metric functions. If we let vi denote F̂�xi� ÿ xi and fl�xi� denote the l thfunction of xi, the ¢rst of which may be a constant term, such a regressioncan be written as

vi �XL

l�1gl fl�xi� � ui i � 1; . . . ;m �16�



One di¤culty is that the ui are neither homoscedastic nor seriallyuncorrelated. However, it turns out to be relatively easy to derive afeasible generalized least squares (GLS) procedure, and this is done in theAppendix.

It is not clear how to choose L and the functions fl�xi� that appear in(16). One obvious approach is to use powers of xi as regressors. Another isto use the functions sin�lpxi� for l � 1; 2; 3 . . ., and no constant term. Theadvantage of the latter approach is that sin�0� � sin�lp� � 0, so that theapproximation, like zi itself, is constrained to equal zero at xi � 0 andxi � 1. However, this may not always be an advantage. If a test over-rejects severely, Fi ÿ xi may be large even for xi near zero, and it may behard for a function that equals zero at xi � 0 to ¢t well with a reasonablenumber of terms. For a given set of regressors, the choice of L can be madein various ways. We have chosen it to maximize the Akaike informationcriterion (AIC), i.e. the value of the log-likelihood function minus L (seethe Appendix).

Figure 8 illustrates the smoothing procedure we have just described.It shows P value discrepancy functions for the bootstrapped OPG IM testfor the regression model with n � 100 and k � 2, based on 199 bootstrapsamples. Because IM tests in linear regression models are pivotal, this testshould work perfectly, except for experimental error. The AIC chose the

Fi g . 8 P Value Discrepancy Plots for the Bootstrap OPG IM Test, Regression Model,n � 100, k � 2

Size discrepancy

Nominal size



trigonometric model with two regressors, sin�pxi� and sin�2pxi�, and theresulting smoothed P value discrepancy function is shown in the ¢gure.The ¢gure also shows con¢dence bands, which were obtained by adding toand subtracting from the ~Fi twice the square roots of the diagonal elementsof the covariance matrix of the ¢tted values. These con¢dence bands arevery narrow near 0 and 1, because of the constraint that the trigonometricfunction must always pass through those points, and they always includethe true value of zero.

When using P value discrepancy plots to study the performance ofbootstrap tests, it is extremely important to choose B, the number ofbootstrap samples, with care. B should be chosen so that x�B� 1� is aninteger for every x at which the discrepancy will be calculated. SeeDavidson and MacKinnon (1996b) for an intuitive explanation andreferences. Since all the curves shown in Fig. 8 are plotted at the points0:005; 0:010; 0:015; . . . ; 0:995, the smallest number of bootstrap samplesthat it is appropriate to use is 199.

Using the wrong number of bootstrap samples can produce quiteunsatisfactory results. This is illustrated in Fig. 9, which is based on thesame set of 10,000 replications as Fig. 8. The P value discrepancy plotsbased on B � 200 and B � 100 are shown along with the one based on

Fi g . 9 P Value Discrepancy Plots for the Bootstrap OPG IM Test, Regression Model,n � 100, k � 2

Size discrepancy

Nominal size



B � 199. Using just one too many bootstrap samples substantially changesthe results: the discrepancies are much more positive than they should be inthe left-hand part of the ¢gure. If the inequality in the de¢nition (1) of theEDF had been made strict, the discrepancies would instead have been toonegative in the right-hand part of the ¢gure. In either case, the shape of thediscrepancy plot will be wrong. Using B � 100, as some authors have donein studies of bootstrap testing, produces even worse results. Note that forB � 100 the ¢gure uses only the points 0:01; 0:02; . . . ; 0:99. If the full set ofpoints were used, the plot would exhibit an extreme sawtooth pattern.

Smoothing provides a way to test the null hypothesis that theobserved P value discrepancies are entirely due to experimental error. Atest that should usually be much more powerful than a KS test is alikelihood ratio (LR) test for c � 0. Such a test may easily be computed.As an example, for the bootstrapped OPG test with B � 200, both thecorrectly computed KS test, based on (14), and the approximate one,based on (15), are equal to 0.0117. These tests are not signi¢cant at the0.05 level �P � 0:129�. In contrast, the LR test for the smoothingregression with polynomial regressors is 34.00 with 2 degrees of freedom,which is signi¢cant at any imaginable level. Note that the LR test statisticsfor the bootstrapped OPG test with B � 199 have P values of 0.256 for

Fi g . 10 Smoothed P Value Discrepancy Plots, Bootstrap Tests, Probit, k � 3: (a) ES Form;(b) OPG Form


Nominalsize

Nominalsize

(a) (b)



the trigonometric speci¢cation and 0.785 for the polynomial one. Thusthe evidence suggests that the bootstrapped OPG test performs exactly theway theory says it should, provided that B is chosen correctly.

We now turn our attention to the bootstrapped IM tests for the probitmodel. These results are based on 10,000 replications with 399 bootstrapsamples per replication. Figure 10 shows smoothed P value discrepancyplots for both forms of the IM test, for k � 3, several sample sizes, and thetwo sets of parameter values used in Fig. 5. Several things are evident fromthis ¢gure. Bootstrapping works remarkably well, but it generally doesnot work perfectly. The discrepancies between nominal and actual size arereasonably small, much smaller than the discrepancies associated withusing asymptotic P values. However, we can reject the hypothesis thatthese discrepancies are zero in every case but one (the ES form for n � 100and b0 � 0). These discrepancies can be of either sign, and they are usuallysmaller for the larger sample sizes, as we would expect. Parameter valuesclearly matter a lot. Because of this, it is unreasonable to expect thatbootstrapped IM tests for the probit model will always work as well asthey seem to in this example.

å Size^Power Curves

It is often desirable to compare the power of alternative test statistics,but this can be di¤cult to do if all the tests do not have the correct size.Suppose we perform a Monte Carlo experiment in which the data aregenerated by a process belonging to the alternative hypothesis. The teststatistics of interest are calculated for each replication, and correspondingP values are obtained. If the EDFs of these P values are plotted, the resultwill not be very useful, since we will be plotting power against nominal testsize. Unfortunately, this is what is often implicitly done when test poweris reported in a table.

In order to plot power against true size, we need to perform twoexperiments, preferably using the same sequence of random numbers. Inthe ¢rst experiment, the DGP satis¢es the null hypothesis, and in thesecond it does not. Let the points on the two approximate EDFs bedenoted F̂�x� and F̂

��x�, respectively. These are to be evaluated at a pre-chosen set of points xi; i � 1; . . . ;m. As before, F�x� is the probability ofgetting a nominal P value less than x under the null. Similarly, F��x� is theprobability of getting a nominal P value less than x under the alternative.Tracing the locus of points �F�x�;F��x�� inside the unit square as x variesfrom 0 to 1 thus generates a size^power curve on a correct size-adjustedbasis. Plotting the points �F̂�xi�; F̂

��xi��, including the points (0, 0) and(1, 1), does exactly the same thing, except for experimental error. By usingthe same set of random numbers in both experiments, we can reduceexperimental error, since the correlation between F̂�xi� ÿ F�xi� and



F̂��xi� ÿ F��xi� will normally be quite high. The idea of plotting power

against true size to obtain a size^power curve is not at all new. When wewrote the ¢rst version of this paper, we believed that this very simplemethod of doing so was new, but in fact it was used by Wilk andGnanadesikan (1968).

Unless the null is a simple one, there will be an in¢nite number ofDGPs that satisfy it. When the test statistic is pivotal, it does not matterwhich one we use to generate F̂�x�. However, when it is not pivotal, thechoice of which DGP to use can matter greatly. In Davidson andMacKinnon (1996b), we argue that a reasonable choice is the pseudo-truenull, which is the DGP that satis¢es the null hypothesis and is as close aspossible, in the sense of the Kullback^Leibler information criterion, to theDGP used to generate F̂

��x�; see also Horowitz (1994, 1995). In theexperiments for the probit model, to be discussed below, we used thepseudo-true null.

Figure 11 shows size^power curves for the OPG, DLR and ES forms

Fi g . 11 Size^Power Curves, Regression Model, Kurtosis, k � 2

Power

Size



of the IM test for regression models with k � 2 for n � 100 and n � 200.The error terms in the non-null DGP were generated as a mixture ofnormals with di¡erent variances, and they therefore displayed kurtosis.Several results are immediate from this ¢gure. The ES form has thegreatest power for a given size of test, followed by the DLR form. TheOPG form has far less power than the other two, and for n � 100 itactually has power less than its size for true sizes that are small enough tobe interesting. Numerous other experiments, for which results are notshown, produced the same ordering of the three tests. The inferiority ofthe OPG form was always very marked.

These experimental results make it clear that, if one is going to usean IM test for a linear regression model, the best one to use is thebootstrapped ES form. It must have essentially the correct size (becausethe test is pivotal), and it will have better power than any of the other tests.This conclusion contradicts that of Davidson and MacKinnon (1992a),who failed to allow for the possibility of bootstrapping and ignored theissue of power. Of course, it might well be better yet to test forheteroscedasticity, skewness and kurtosis separately. The componentpieces of the ES form (5), with P values determined by bootstrapping,could be used for this purpose.

There is one practical problem with drawing size^power curves byplotting F̂

��xi� against F̂�xi�. For tests that under- or over-reject severelyunder the null, there may be a region of the size^power curve that is leftout by a choice of values of xi such as those given in (2) or (3). Forinstance, suppose that a test over-rejects severely for small sizes, as theOPG IM test does. Then, even if xi is very small, there may be manyreplications under the null for which the realized P value is still smaller.As an example, for the OPG test in the regression model with n � 100 andk � 5, F̂�0:001� � 0:562 and F̂

��0:001� � 0:397. Therefore, if the size^power curve were plotted with x1 � 0:001, there would be a long straightsegment extending from (0, 0) to �0:562; 0:397�. Such a straight segmentwould bear clear witness to the gross over-rejection of which the OPG IMtest is guilty, but it would also bear witness to a severe lack of detail inthe depiction of how the test behaves.

It could well be argued that tests which behave very badly under thenull are not of much interest, so that this is not a serious problem. In anycase, the problem is not di¤cult to solve. We simply have to make surethat the xi include enough very small numbers. Experience suggests thatadding the following 18 points to (3) will produce reasonably good resultseven in extreme cases:

xi � 0:1� 10ÿ8; 0:2� 10ÿ8; 0:5� 10ÿ8; . . . ; 0:1� 10ÿ3; 0:2� 10ÿ3; 0:5� 10ÿ3

Of course, this assumes that small P values are computed accurately,which may not always be the case.



In the remainder of this section, we discuss size^power curves forIM tests in probit models. These turn out to be quite interesting. The dataare generated by the model

E�yt j Ot� � Fb0 �Xt1 �Xt2

exp�0:5Xt1 � 0:5Xt2��

�17�

where b0 � 0 or b0 � 1. This model is a special case of (13), and it is similarto one of the alternatives used by Horowitz (1994).

We performed six experiments, for n � 50, n � 100 and n � 200, foreach of b0 � 0 and b0 � 1. We also ran a few additional experiments withk � 2 instead of k � 3 to verify that the number of regressors did not a¡ectthe principal conclusions. These experiments used 10,000 replications withB � 799; the value of B was relatively large in order to ensure that powerloss from bootstrapping would be minimal. It is not possible to graph theresults for all six experiments in one ¢gure, but Fig. 12 does show theprincipal results. The ES test with asymptotic P values performs verymuch better than the asymptotic OPG test when b0 � 1. However, whenb0 � 0, it performs a bit better when n � 100 and a bit less well whenn � 200.

Fi g . 12 Size^Power Curves for Probit IM Tests, k � 3: (a) b0 � 0; (b) b0 � 1

Power Power

Size Size

(a) (b)



Bootstrapping often has a substantial e¡ect on the performance ofthe OPG test, but it generally has little e¡ect on the performance of the EStest. The power of both tests is sometimes increased and sometimesreduced. Both these results are consistent with the theoretical results ofDavidson and MacKinnon (1996b), since, as we saw in Fig. 10, thebootstrapped OPG test is much farther from being pivotal than thebootstrapped ES test. Because the asymptotic tests, especially the OPGform, are emphatically not pivotal, the size^power curves for those tests inFig. 12 might have been very di¡erent if we had used a di¡erent nullDGP to generate F̂�x�. It is safer, but still not entirely safe, to compare thesize^power curves for the bootstrap tests.

From these results and those discussed earlier, we conclude that the ESform of the IM test with bootstrap P values is the procedure of choice forprobit models, just as it is for regression models. It never has substantiallyless power than the bootstrap OPG form, and it sometimes has substan-tially more. The bootstrap P values are much more accurate than theasymptotic ones, especially when they are very small: compare Figs 5 and6 with Fig. 10. However, Fig. 10 makes it clear that bootstrapping will notproduce entirely accurate test sizes for small samples.

æ Conclusion

Monte Carlo experiments are a valuable tool for obtaining informationabout the properties of speci¢cation testing procedures in ¢nite samples.However, the rich detail in the results they provide can be di¤cult tocomprehend if presented in the usual tabular form. In this paper, we havediscussed several graphical techniques that can make the principal resultsof an experiment immediately obvious. All these techniques rely on theconstruction of an estimated c.d.f. (EDF) of the nominal P valuesassociated with some test statistic. From these, we can easily obtain avariety of diagrams, namely P value plots, P value discrepancy plots(which may optionally be smoothed), and size^power curves.

We have illustrated these techniques by presenting the results of anumber of experiments concerning alternative forms of the IM test. Webelieve that these results, which are entirely presented in graphical form,are of considerable interest and provide more information in a more easilyassimilated fashion than a tabular presentation could possibly have done.

Appendix

In this appendix we discuss GLS estimation of equation (16). If the regressionfunction in (16) were chosen correctly, the error term ui would be equal toF̂�xi� ÿ F�xi�. It can be shown that, for any two points x and x0 in the (0, 1)interval,



var�F̂�x�� Nÿ1F�1ÿ F�and

cov�F̂�x�; F̂�x0�� Nÿ1�min�F;F 0� ÿ FF 0� �A1�Here F � F�x� and F 0 � F�x0�. The ¢rst line of (A1) is just a restatement of thewell-known result about the variance of the mean of N Bernoulli trials. Expression(A1) makes it clear that X, the m� m covariance matrix of the ui in (16), exhibitsa moderate amount of heteroscedasticity and a great deal of serial correlation.Both the standard deviation of ui and the correlation between ui and uiÿ1 aregreatest when Fi � 0:5 and decline as Fi approaches 0 or 1.

Equation (16) can be rewritten using matrix notation as

v � Zc� u E�uu0� � X �A2�where Z is an n� L matrix, the columns of which are the regressors in (16). TheGLS estimator of c is

~c � �Z 0Xÿ1Z�ÿ1Z 0Xÿ1v �A3�The smoothed discrepancies are the ¢tted values Z ~c from (A2), and the covariancematrix of these ¢tted values is

Z�Z 0Xÿ1Z�ÿ1Z 0 �A4�Con¢dence bands can be constructed by using the square roots of the diagonalelements of (A4).

It is easily veri¢ed that the inverse of X has non-zero entries only on theprincipal diagonal and the two adjacent diagonals. Speci¢cally,

Oÿ1ii � N�Fi�1 ÿ Fi�ÿ1 �N�Fi ÿ Fiÿ1�ÿ1Oÿ1i;i�1 � ÿN�Fi�1 ÿ Fi�ÿ1Oÿ1i;iÿ1 � ÿN�Fi ÿ Fiÿ1�ÿ1

�A5�

for i � 1; . . . ;m, and Oÿ1i; j � 0 for jiÿ j j > 1. In these formulae, F0 � 0 andFm�1 � 1. In contrast to many familiar examples of GLS estimation, it is neithereasy nor necessary to triangularize X

ÿ1 in this case. Because Xÿ1 has non-zero

elements only along three diagonals, it is not di¤cult to compute Z 0Xÿ1Z andZ 0Xÿ1v directly. For example, the l j th element of Z 0Xÿ1Z isXm

i�1ZilZi jO

ÿ1ii �

Xm

i�2Ziÿ1;lZi jO

ÿ1i;iÿ1 �

Xmÿ1i�1

Zi�1;lZi jOÿ1i;i�1 �A6�

where the needed elements of Xÿ1 were de¢ned in (A5). Thus, by using (A6), it is

straightforward to compute the GLS estimates (A3).True GLS estimation is not feasible here. If the test statistic being studied is

well behaved, however, F�xi� will be close to xi for all xi, and it will be reasonableto use xi instead of the unknown Fi in (A5). This will yield approximate GLSestimates. If the test statistic is not so well behaved, it is natural to use a two-stageprocedure. In the ¢rst stage, the approximate GLS estimates are obtained. In thesecond stage, the unknown Fi in (A5) are replaced by �Fi � xi � �zi, where �zi denotesthe ¢tted values from the approximate GLS procedure. Note that the estimatesof both Fi and Fi ÿ Fiÿ1 must be positive for the GLS procedure to work. Sincethese two conditions may not always be satis¢ed by the �Fi, it may be necessary



to modify them slightly before computing the feasible GLS estimates and the¢nal estimates ~Fi.

The log-likelihood function associated with GLS estimation of (A2) is

ÿm

2log�2p� � 1

2jXÿ1j ÿ 1

2�vÿ Zc�0Xÿ1�vÿ Zc� �A7�

The determinant of Xÿ1 which appears in (A7) is

Nm

�Ymi�0

ai

�Xm

j�0

1aj

�A8�

where ai � �Fi�1 ÿ Fi�ÿ1, for i � 0; . . . ;m, and, as before, F0 � 0 and Fm�1 � 1.It is important to make sure that (16) ¢ts satisfactorily, as it may not if Z

has been chosen poorly. One simple approach is to calculate the GLS equivalent ofthe regression standard error

s � 1nÿ L

~u 0Xÿ1 ~u� �1=2

where ~u is the vector of (feasible) GLS residuals from estimation of (16). If Z hasbeen speci¢ed correctly, s should be approximately equal to unity.

References

Beran, R. (1988). `Prepivoting Test Statistics: a Bootstrap View of AsymptoticRe¢nements', Journal of the American Statistical Association, Vol. 83,No. 403, pp. 687^697.

Chesher, A. (1983). `The Information Matrix Test: Simpli¢ed Calculation via aScore Test Interpretation', Economics Letters, Vol. 13, No. 1, pp. 45^48.

Chesher, A. and Smith, R. J. (1997). `Likelihood Ratio Speci¢cation Tests',Econometrica, Vol. 65, No. 3, pp. 627^646.

Chesher, A. and Spady, R. (1991). Àsymptotic Expansions of the InformationMatrix Test Statistic', Econometrica, Vol. 59, No. 3, pp. 787^815.

Cribari-Neto, F. (1997). Òn Corrections to Information Matrix Tests',Econometric Reviews, Vol. 16, No. 1, pp. 39^54.

Cribari-Neto, F. and Cordeiro, G. M. (1996). Òn Bartlett and Bartlett-typeCorrections', Econometric Reviews, Vol. 15, No. 4, pp. 339^367.

Davidson, R. and MacKinnon, J. G. (1984). `Model Speci¢cation Tests Based onArti¢cial Linear Regressions', International Economic Review, Vol. 25, No. 2,pp. 485^502.

Davidson, R. and MacKinnon, J. G. (1992a). À New Form of the InformationMatrix Test', Econometrica, Vol. 60, No. 1, pp. 145^157.

Davidson, R. and MacKinnon, J. G. (1992b). `Regression-based Methods forUsing Control Variates in Monte Carlo Experiments', Journal ofEconometrics, Vol. 54, No. 1^3, pp. 203^222.

Davidson, R. and MacKinnon, J. G. (1993). Estimation and Inference inEconometrics, New York, Oxford University Press.

Davidson, R. and MacKinnon, J. G. (1996a). `The Size Distortion of BootstrapTests', GREQAM Document de Travail 96A15 and Queen's Institute forEconomic Research Discussion Paper 936.

Davidson, R. and MacKinnon, J. G. (1996b). `The Power of Bootstrap Tests',Queen's Institute for Economic Research Discussion Paper 937.



Fisher, N. J., Mammen, E. and Marron, J. S. (1994). `Testing for Multimodality',Computational Statistics and Data Analysis, Vol. 18, No. 5, pp. 499^512.

Hall, A. (1987). `The Information Matrix Test for the Linear Model', Review ofEconomic Studies, Vol. 54, No. 2, pp. 257^263.

Hansen, L. P., Heaton, J. and Yaron, A. (1996). `Finite-sample Properties of SomeAlternative GMM Estimators', Journal of Business and Economic Statistics,Vol. 14, No. 3, pp. 262^280.

Hendry, D. F. (1984). `Monte Carlo Experimentation in Econometrics', in Z.Griliches and M. D. Intriligator (eds), Handbook of Econometrics, Vol. II,Amsterdam, North-Holland, ch. 16.

Horowitz, J. L. (1994). `Bootstrap-based Critical Values for the InformationMatrix Test', Journal of Econometrics, Vol. 61, No. 2, pp. 395^411.

Horowitz, J. L. (1995). `Bootstrap Methods in Econometrics: Theory andNumerical Performance', paper presented at the 7th World Congress of theEconometric Society, Tokyo.

Lancaster, T. (1984). `The Covariance Matrix of the Information Matrix Test',Econometrica, Vol. 52, No. 4, pp. 1051^1053.

Mammen, E. (1992). When Does Bootstrap Work?, New York, Springer.Orme, C. (1988). `The Calculation of the Information Matrix Test for Binary Data

Models', The Manchester School, Vol. 56, No. 4, pp. 370^376.Taylor, L. W. (1987). `The Size Bias of White's Information Matrix Test',

Economics Letters, Vol. 24, No. 1, pp. 63^67.West, K. D. and Wilcox, D. W. (1996). `A Comparison of Alternative Instrumental

Variables Estimators of a Dynamic Linear Regression Model', Journal ofBusiness and Economic Statistics, Vol. 14, No. 3, pp. 281^293.

White, H. (1982). `Maximum Likelihood Estimation of Misspeci¢ed Models',Econometrica, Vol. 50, No. 1, pp. 1^26.

Wilk, M. B. and Gnanadesikan, R. (1968). `Probability Plotting Methods for theAnalysis of Data', Biometrika, Vol. 33, No. 1, pp. 1^17.



graphical methods for investigating the size and power of hypothesis tests

Documents