chapter 12 testing hypotheses - cios

165 Part 2 / Basic Tools of Research: Sampling, Measurement, Distributions, and Descriptive Statistics

Chapter 12

Testing Hypotheses

The previous chapters have provided us with the basic tools and strategies needed to testhypotheses. In this chapter, we will combine the descriptive statistics for variables and forrelationships between variables with their sampling distributions to test hypotheses. This is

the point at which our “verbal world” of theory meets the “measurement world” of empirical veri-fication.

The process of hypothesis testing as it is outlined below is a general one, and one that under-lies virtually all the methods for analyzing data that have been introduced in the previous chapters,as well as ones that will be encountered in much greater detail in later chapters in this book. Allthese procedures have the same basic idea — the comparison of one outcome to an alternativeoutcome, stated as a probability.

As long as we are sampling from a population, we can never be absolutely sure that our resultsare not due to random error, so all true/false statements are really “probably true/probably false”statements. Computing the probability of truth and using this value to decide between competinghypotheses is the process that we’ll focus on in this chapter.

It is very important to have a clear understanding of the different types of hypotheses beforeproceeding. These were outlined in Chapters 3 and 4. You should be clear about the differencesbetween comparative and relationship hypotheses; directional and nondirectional hypotheses; andalternative or research hypotheses versus null hypotheses. Table 12-1 outlines the basic kinds ofhypotheses, as described in Chapter 3.

Notice that all hypotheses compare one situation to another. In relationship hypotheses, cova-riances for the research hypothesis are compared to covariances in the opposite direction (non-nullcompeting hypothesis) or zero covariance (null hypothesis). In comparative hypotheses, a descrip-

Chapter 12: Testing Hypotheses


tive measure (such as the mean or the proportion) for one group is compared to the same descrip-tive measure for another group. The competing hypotheses here are differences in the oppositedirection (non-null competing hypothesis) or zero differences (null hypothesis).

However, since relationship and comparative hypotheses differ only in the level of measure-ment of the independent variable, we can treat both the in same way. In fact, we’ll begin referring toeither as “research hypotheses”. When you see this term, recognize that the hypothesis being dis-cussed can be either a comparative or relationship hypothesis, and it really doesn’t matter whetherwe are talking about differences between the means of two groups, or the difference of a correlationcoefficient from zero.

We can similarly simplify the discussion of competing hypotheses. The competing hypothesiscan either be a null hypothesis (no covariance, or no difference between groups) or a non-null hy-pothesis which states that the direction of the covariance or of the group difference is in the oppositeof that predicted by the hypothesis. From the standpoint of the procedures which follow, this dis-tinction is not relevant. General hypothesis testing involves testing the value of a statistic against acompeting value, and it really doesn’t matter if the competing value is zero (null), or something else.

In most discussions of hypothesis testing the null hypothesis is used as the competing expla-nation to the research hypothesis. As we saw in the previous chapter, this is a natural comparison,since the null hypothesis describes a situation in which there is no relationship between the inde-pendent and dependent or criterion variable, while the research hypothesis describes the compet-ing situation in which there is a relationship. We’ll also use the null hypothesis as the primary



competing hypothesis, but again urge the reader to keep in mind that any other competing hypoth-esis value could also be used. The logic for testing the null competing hypothesis (we’ll start short-ening this to “null hypothesis”) is quite simple and can be easily illustrated in an example.

Testing a Comparative HypothesisIn this example, we want to establish whether a relationship exists between two variables. The

independent variable is the attempt to persuade through a communication campaign, which hastwo levels: absent, or present. The test of the effect of the persuasion attempt will consist of contrast-ing the donations of two groups: the donations of the whole work force (those who donated prior tothe letter writing campaign - in the absence of a communication campaign) and the donations of thesample (those who received the boss’ letter and were possibly affected by the communication). Theamount of donation is the dependent variable, and it is measured at the ratio level. Since we will becontrasting the average donations of the two groups (those who received the communication andthose who did not), we will use a comparative hypothesis.

The rationale given above for the effectiveness of the letter implies that we need a non-direc-tional research hypothesis, since we don’t know whether the letter writing campaign, if it is effec-tive, will increase or decrease donations:

HR: Mbefore letter ≠ after letter

If the letter is irrelevant, we have the following null hypothesis:

H0: Mbefore letter = after letter

Another way of stating the null hypothesis is this: if the letter writing campaign does not affectdonations, then the average donations of employees who have received the letter will be the same asthe average donation prior to the letter.

This statement has a very important implication. If the null hypothesis is true, then we canexpect that the group which received the encouraging letter (the sample of 50 employees) is thesame in all respects as those who did not receive a letter. This implies that the population param-eters (the mean and variance of donations) are identical to those of the sample. Testing the nullhypothesis is thus a fairly simple procedure, because we know what descriptive values we should

Exhibit 12-1 Testing a Nondirectional Comparative Hypothesis

The owner of a medium sized cor-poration receives a request from a localcharitable giving fund coordinator to per-suade her employees to increase their con-tributions to this fund. Through an inspec-tion of payroll records she determines thather employees make donations to thisfund that amount to, on the average, $150per year, and that these donations have avariance equal to $3600. Before writing apersonal letter to all employees she con-sults an Employee Communications con-sultant who warns her that the conse-quences of such a letter might be twofold:the letter can either increase contributions,as employees may feel flattered by hear-

ing from the boss; or contributions maydecrease, as it is possible that employeesmight see the letter as an intrusion intowhat they consider a private matter. Giventhe uncertainty regarding the outcome,the owner holds off sending a letter to allemployees and decides instead to test forthe effect of the letter by writing to a ran-domly selected sample of 50 employeesand to see how their contributions change.A reasonable time after sending the letter,the payroll records of the random samplewill be inspected and this data will be usedto determine the specific effect, if any, ofthe campaign.



find in the sample, if the null hypothesis is correct.

The Null Hypothesis and the Sampling Distribution of MeansAssume then that the Null hypothesis is correct. If the sample of 50 employees is representa-

tive of the population (and it should be, if we have drawn an unbiased probability sample), then weknow that we should find a sample mean of $150 and a sample variance of 3600. We can further statethat the mean of this sample will be part of a sampling

distribution of means which would be obtained if we were to take all random samples of N =50 out of this population. Such a sampling distribution can be completely described by the applica-tion of the Central Limit Theorem, which generates the following statements:

• The mean of the sampling distribution (X) will be equal to M, the mean of the population,or $150.

• The variance of the sampling distribution will be equal to r2/N, where r2 is the populationvariance ($3600), and N is the sample size (50). Hence, the sampling variance will be equalto $3600/50= $72.00.

• The standard error will then equal:

• The sampling distribution of means will tend toward a standard normal distribution if N islarge.

The sampling distribution that would result from taking all samples of N = 50 from this popu-lation is shown in Figure 12-1. Remember that a sampling distribution of means provides us withtwo pieces of information. First, the sampling distribution of means is a distribution of samplingerror. It shows us the size of the discrepancies between statistics and parameters that we shouldexpect to observe, if we take samples of a given size out of a population with known parameters.



Secondly, since we can “superimpose” the standard normal distribution, the sampling distributionof means also becomes a distribution of the probabilities of observing any particular size of sam-pling error. The sampling distribution shows the probabilities associated with the various samplemeans, or, to put it in another way, shows how likely is the occurrence of various amounts of sam-pling error.

The sampling distribution shown in Figure 12-1 indicates that if we were to take all possiblesamples of N = 50 out of this population, we would observe that 68.26% of all the sample meanswould be between $141.52 and $158.48. Furthermore, our chances of observing any of the samplemeans of less than $133.04 would be .0227, the same as our chances of observing any sample meanover $166.96.

If the null hypothesis is true, and if we could sample from the population without incurringsampling error, we would find the sample mean exactly equal to the parameter, or $150. However,sampling error will occur. The intervention of purely random events means that we should reason-ably expect some difference between the parameter and the statistic, even if the null hypothesis istrue. However, we expect the value of the statistic to be fairly close to the value of the parameter.

As Figure 12-1 shows, we have roughly a two-out-of-three chance (68.26% probability, to beexact) of observing a sample mean between $141.52 and $158.48, when the true population mean is$150.00. If we observe a sample mean within this range, that is, one which is reasonably close to thecenter of the sampling distribution associated with the null hypothesis, we can safely assume thatthe null hypothesis is probably correct, and that the difference between the population parameterand the sample mean is probably due to random error.



The Null Hypothesis and the Region of RejectionThe last sentence of the previous section contains a key idea: as long as we observe a negligible

difference between the mean donation made by the receivers of the letter, and the population meanobserved before the letter writing campaign began, we can remain confident that the H0 is probablytrue. But what happens to our confidence in the null hypothesis as the difference between the samplemean and the population mean increases? As the sample mean deviates further from the center ofthe sampling distribution, its probability of occurrence under the null hypothesis is lower. It be-comes increasingly unlikely that it occurred simply by random chance. Our confidence in the truthof the null hypothesis will wane accordingly.

If the mean of the sample of employees was found to be located in one of the extreme tails ofthe distribution, we would have very little confidence in the H0, because if the null hypothesis wascorrect, we should have observed a sample mean which was much closer to the expected value atthe center of the sampling distribution.

Just as we have less confidence in the truth of the null hypothesis when the sample meandeviates widely from the center of the sampling distribution, we simultaneously have increasingconfidence that the nondirectional research hypothesis is probably correct, since only one of the twoalternatives can be right.

Figure 12-2 illustrates how our confidence in either hypothesis varies as a consequence of thelocation of the sample mean. As we move further into the tail of the sampling distribution for thenull hypothesis, we will eventually arrive at some point where the degree of confidence we have inthe null hypothesis simply is not sufficient to consider that hypothesis as a reasonable explanationof the actual situation. At that point we will “fail to accept the H0”, leaving us no alternative but toconsider the HR (the research hypothesis) as probably true.

The problem lies in determining when we have reached the point where we no longer feelconfident that the null hypothesis is true. To make this determination we must convert the continu-ously varying probabilities into a true/false dichotomy. This is akin to converting varying shades ofgray into “black” and “white” by using a criterion paint chip for comparison: darker than the chip is“black”, lighter than the chip is “white”. The decision to reject (or fail to accept) the null hypothesisis made by employing a similar criterion. The application of this decision rule involves setting up aregion of rejection in the sampling distribution. To do this, we set a critical value of the statisticlocated at some distance from the center of the sampling distribution. Statistics which are closer tothe center of the distribution than the critical value are considered to be close enough to the valuepredicted by the null hypothesis that we cannot reject H0. Values further from the center than thecritical value are considered improbable enough for us to reject H0, and accept the research hypoth-esis.

But where in the sampling distribution should we place the critical value? Remember thatvalues of the test statistic that are representative of a true null hypothesis will be observed at or nearthe center of the sampling distribution. Those sample means that are observed further away (in thetails of the distribution) are considered as less representative of null conditions. Accordingly, theregion of rejection will be located somewhere in the tail(s) of the distribution. The region of rejec-tion (the range of means in which we can reject H0) will be located in one or both tails of the sam-pling distribution, depending on whether the research hypothesis is directional or nondirectional.If the sample statistic that we observe is within the region of rejection, we will reject the null hypoth-esis. Observing a value for the test statistic not in the region of rejection will allow us to accept thenull.

The Size of the Region of RejectionThe size of the region of rejection is determined by the amount of confidence that we wish to

place in our decision to accept or reject the null hypothesis. We saw in Chapter 10 how to translateto a probability an area under the normal curve that describes the sampling distribution. As we sawthere, this probability can be interpreted as a sampling error probability. Stating such an error prob-ability, such as .05 (5%), or .01 (1%), is also referred to as setting the level of significance. The prob-ability level is commonly abbreviated to p and is also referred to as the alpha level or a for testing theNull hypothesis.

When we set the level of significance at p = .05, we set aside an area under the tail or tails of the



sampling distribution which contains the most extreme 5% of the values of the test statistic. Thisarea is 5% of the total area under the normal curve, and the test statistics falling in this area or areastogether constitute .05 (or 5%) of all possible sample statistics in the sampling distribution. We thenknow that there is only a 5% chance that a sample statistic will fall in this area due to sampling error,when the actual population value is the one predicted by the null hypothesis. If our sample statisticthen falls in the region of rejection, we can be confident that there is only a 5% chance of the nullhypothesis being true, and a 95% chance that the research hypothesis is true.

As we’ll see below, when we set the level of significance, we automatically set the critical valueof the test statistic, since only one value (or two, if the hypothesis is nondirectional, and the regionof rejection lies under both tails of the sampling distribution) of the test statistic will delimit 5% ofthe area under the normal curve.

The size of the region of rejection is selected with an eye on the consequences of our decisionabout whether to accept or reject the null hypothesis. These consequences will be discussed belowwhen we talk about the various errors that can be made in testing hypotheses.

Where to Locate the Region of Rejection: Directional andNon-directional Hypotheses

Whether or not the alternative hypothesis is directional (and if so, in which direction) is criti-cal in determining the location of the region of rejection. Returning to the communication campaignexample presented in Exhibit 12-1, assume that we decide to test the H0 at p = .10. We will then setaside a region of rejection equal to 10% of the area under both tails of the sampling distribution.Figure 12-3 shows this graphically.

The nondirectional research hypothesis requires that the region of rejection be located in bothtails of the distribution, since we are not predicting whether the effect of the communication will beto increase or to decrease donations. Therefore the total error probability (.10) is divided over both



tails, with an area equal to .05 in each tail. In the table of normal cumulative probabilities in Appen-dix A, we find that .05 of the area under the curve remains in the tail of a distribution when thestandard score z equals + or - 1.65. Another way of stating this is to say that between z = -1.65 and z=+1.65 lies 90% of the total area under the curve.

If our sample mean is in the region of rejection, it will have to be at least 1.65 standard errors(in either direction) away from the mean of the sampling distribution, which is centered around theexpected null value. That means that the critical value of z in this case is +1.65.

Whether a sample mean in fact falls in the region of rejection can be determined by computinga z-statistic. For the differences between a sample and a population mean, the following formula isused:

Notice that this formula is the exact analog of the standard score, or z-score which we havebeen using. The z-statistic represents the deviation of a statistic (the sample mean) from the mean ofthat set of statistics (the population or grand mean) under a normal curve, in the sampling distribu-tion of statistics. The z-score represents the deviation of a data value, also under a normal curve, inthe sample distribution. A similar deviation statistic can be computed for any test statistic, as we’llsee below when we test a relationship hypothesis.

The computed value of the z-statistic is evaluated against the critical value to see if it meets orexceeds the minimal value required to be in the region of rejection. Let’s see how this operateswithin our example:

Three months after the letters are mailed, the payroll records for the employees in the random sample arechecked. The average payroll deduction, prorated for a year, is found to be $170.00.

The computed z-statistic is then equal to:

As zobserved = zcritical, the observed sample mean is within the region of rejection. Consequently,we will reject the null hypothesis which states that the encouragement to donate will have no effect.Rejecting this hypothesis then implies its logical alternative—that encouragement did have an ef-fect. More specifically, by inspecting the value of the mean of the sample, we conclude that its effectwas to increase contributions to charity.

But suppose we have a more specific theoretical linkage which states a direction to the effectwhich we expect to find if the research hypothesis is true. A directional research hypothesis impliesthat the region of rejection is located in only one of the two tails of the distribution, namely the tailwhich is in the direction stated by the research hypothesis.

Let us return to our example. Assume that the theoretical linkage we develop states that em-ployees will be honored to have this donation request directed at them by their boss, and thereforethey will increase their payroll deductions. The HR will then be:

HR: Mbefore letter < after letter

In other words, based on our theoretical linkage we expect that the observed mean of thesample will be GREATER than the population value measured before the communication campaignbegan. Logically associated with this research hypothesis is the H0:

HR: Mbefore letter = fter letter



The research hypothesis states that the mean of the sample will be GREATER than the popula-tion mean. This HR then dictates that we locate the region of rejection (equal to .10 of the area underthe normal curve) in the right hand tail of the sampling distribution which contains all of the sample

means which are greater than the population mean M. Note that the H0 is really two competinghypotheses in this case: (1) the population mean is equal to the sample mean; and (2) the populationmean is greater than the sample mean. But since the truth of either implies that we have found noevidence for our research hypothesis, we will follow the standard practice of referring (somewhatconfusingly) to both competing hypotheses as the null hypothesis. Figure 12-4 shows this hypoth-esis.

A quick reference to the table in Appendix A indicates that the sample means in the region ofrejection equal to .10 of the area under the curve are at least +1.30 standard errors above the mean ofthe sampling distribution. In other words, the critical value of z for the directional hypothesis hasonly one value, +1.30. The observed sample mean of $170.00, with an associated z-statistic of +2.36 islarger than this critical value, and is located in the region of rejection. Again, we can reject the nullhypothesis.

A different theoretical linkage might produce a different alternative research hypothesis, how-ever. Suppose our theory leads us to the conclusion that writing a letter to employees will be viewedby the recipients as an invasion of privacy and this will result in reduced donations. The researchhypothesis would then be:

HR: Mbefore letter > after letter

Based on this different theoretical linkage we would expect that the mean of the sample to beobserved to be LESS than M. Logically associated with this research hypothesis is the following H0:



HR: Mbefore letter > after letter

Figure 12-5 illustrates the decision rule: we will reject the H0 only if the observed sample meanis located in the region which makes up the extreme 10% of all sample means in the left-hand tail ofthe sampling distribution. In this case the critical value of the z-statistic is -1.30, or 1.30 standarderrors below the mean of the sampling distribution. An observed sample mean of $170, and itsassociated zstatistic of +2.36, will not be located in the region of rejection. We will therefore fail toreject the null hypothesis which states that the sample mean will be equal to or greater than thepopulation mean. Although the sample mean is a substantial distance away from the mean of thesampling distribution, it is in the wrong direction, so it does not support our research hypothesis.

Testing a Relationship HypothesisWe can use the same ideas developed above to test relationship hypotheses. We’ll use an ex-

ample from the previous chapter to illustrate how to test the significance of a correlation coefficient.In this example, completely described in Exhibit 11-6 in the previous chapter, we are looking at apossible relationship within organizations between two variables: “Degree of Change” and “Com-municated Formalization”. The variable called “Degree of Change” is defined by the number ofnew products that have been introduced by the organization in the past year. The variable “Com-municated Formalization” has been defined as the percentage of positions in the organization forwhich written job descriptions have been made available. The hypothetical data set and some neces-sary computations are shown in Table 12-2.

Because the level of measurement is interval for both the independent and the dependentvariable, the Pearson Product Moment Correlation Coefficient will be used to compute the degreeof relationship between the two variables in this sample. This correlation coefficient will be equal to



+1.00 in case of a perfect positive relationship; -1.00 if there is a perfect negative relationship andwill equal 0 when there no relationship between two variables.

Based on the theoretical linkage that the greater the change occurring in an organization, theless sense it makes to define jobs, as they are bound to undergo change, the following researchhypothesis is proposed:

—HR: # New Products Positions with Job Descriptions (%)

along with its associated H0:

+ or 0

H0: # New Products Positions with Job Descriptions (%)

The Null Hypothesis of No Relationship and the SamplingDistribution of the Pearson Correlation Coefficient

If the null hypothesis of no relationship is true, then the correlation coefficient between thenumber of new products and percent of positions with written job descriptions should be equal to0.0. Furthermore, the correlation coefficient observed in this sample should be part of a samplingdistribution of correlation coefficients which would be obtained if we were to take all samples ofthis size N out the population of organizations. That sampling distribution is shown in Figure 12-6:



The sampling distribution shows the various correlation coefficients that could be obtained ifH0 is true, when samples of a given size N are taken. If the null hypothesis is in fact the true hypoth-esis, we would expect to observe a correlation coefficient of 0.0, or very close to 0.0, in our sample.If we do observe a correlation coefficient that is close to zero, we will consider the H0 to be probablytrue. If the correlation differs a great deal from zero, we will consider H0 to be probably false.

Because the HR predicts that a negative relationship between the two variables will be ob-served, we have a directional hypothesis, and the region of rejection will be established in the lefthand tail of the sampling distribution. For this example we have selected a region of rejection equalto .05 (or 5%) of the area under the curve. The table showing the areas under the normal distributionshows us that .05 of the area under the curve will remain in the tail of the distribution when we are- 1.65 standard errors away from the mean of the distribution.

To determine whether of not the null hypothesis can be rejected requires two steps. The firststep is the computation of the Pearson Correlation Coefficient, which describes the size of the rela-tionship between the two variables. (We’ll expand futher on the correlation coefficient in Chapter19). The second step is to find out how far, in standard errors, that this correlation deviates from thezero value predicted by the hypothesis of no relationship.

The actual computation of the Pearson Product-Moment Correlation Coefficient takes placeby means of a rather cumbersome looking formula:

As you can see in Table 12-2, the obtained Pearson Correlation Coefficient between the twovariables is -.44. This describes the size of the relationship. But is this far enough from the valuepredicted by the null hypothesis that we can reject the null? To answer this question, we will deter-mine whether the observed correlation coefficient is located in the region of rejection. This meansthat it must be at least -1.65 standard errors away from the mean of the sampling distribution.

To determine this we will execute the next step: compute a z-statistic. If we assume the samplesize to be sufficiently large (see the Note in Table 12-2) the formula for z is:

In this formula the numerator ( r - 0) represents “how far” the observed correlation coefficientis away from 0, the expectation under the hypothesis of no relationship, and where

is the standard error of the sampling distribution of correlation coefficients. The formula forthe standard error tells us that the larger the number of observations, the smaller will be the stan-dard error. If N = 25, the standard error would be 1/5 = .20; if N = 100, the standard error would be 1/10 = .10, etc.

The computed z-statistic for our example is then

As zobserved(-1.244) > zcritical(-1.65), the computed correlation coefficient is NOT within the region



of rejection. Consequently, we will accept the null hypothesis that states that there is no relationshipbetween the variables as the probably true hypothesis. We will conclude that there is no systematicparallel change in these two variables.

Errors and Power in Hypothesis TestingIn the information campaign example we tested the null hypothesis that the letter writing

campaign had no effect. The statistical test produced a z-statistic of +2.36, which, when contrastedwith a critical value of z, allowed us to reject the nondirectional null hypothesis. We concluded thatthe null hypothesis is probably false, and that the research hypothesis is probably true. It needs to beemphasized here that we speak in terms of “probably true” and “probably false”, because we cannever be completely sure that the null hypothesis in fact is not the true hypothesis.

The reason we are never sure is because in contrasting the null hypothesis and the researchhypothesis, it is possible to make two kinds of errors, along with two correct inferences. These four

situations arise out of the decisions the researcher makes about the null hypothesis—to reject the H0versus failure to reject—combined with the true situation which exists within the population—thenull hypothesis is actually true or the null hypothesis is actually false. These combinations are sum-marized in Table 12-3.

Type 1 Error and the Correct Conclusion of No RelationshipThe probability of correctly accepting or rejecting the null hypothesis is based on the area under

the normal curve that is associated with the sampling distribution for the null hypothesis. This curveis centered around the expected value for the null hypothesis, and its “width” (actually the disper-sion or standard error of the sampling distribution) is determined by the number of observations andthe dispersion of the original data (see Chapter 9). This curve is on the left in Figure 12–7.

If the researcher rejects the H0, when in fact it is the true hypothesis, Type 1 error is committed.The researcher will erroneously conclude that the research hypothesis is probably true, when actu-ally there is no relationship between the independent and dependent variables.

The probability of this error occurring is determined by the size of the region of rejection, aswe can see in Figure 12–7. This probability is stated by the level of significance, e.g. p = .05). The levelof significance is just another way of stating the probability of making a Type 1 error.

If we find a value of the sample statistic which does indeed fall into the region of rejection, weconsider that its occurrence by chance is highly unlikely, and hence the sample statistic is indicatingthe probable falsehood of the null hypothesis. However, it is very important that we remember thateven though the sample result was very unlikely under the null hypothesis, it is not impossible toobtain such results by pure chance happenstance. Sample statistics in the region of rejection have afinite probability of occurrence, even when the null hypothesis is true. This probability, which isequal to the area under the normal curve within the region of rejection, is the probability of commit-ting a Type 1 error. The larger the region of rejection, the more likely it is that we will reject the nullhypothesis, and thus the more likely we are to commit a Type 1 error. This can be seen by contrast-ing the a error in Figure 12-8 with that in Figure 12-7.

The farther the critical value is from the center of the sampling distribution which is centeredaround the null hypothesis expected value, the lower is the probability that a sample statistic of the



critical magnitude (or greater) will occur by chance. So decreasing the region of rejection (e.g. from.05 to .025) makes it harder to reject the null hypothesis, and therefore decreases the risk of commit-ting Type 1 error. It also improves the chances of making a correct No Relationship decision. Theprobability of making this correct decision is (1 - a), so it increases as a decreases. However, thisdecision has an impact on Type 2 error, as we’ll see below.

Type 2 Error and Statistical PowerThe probability of correctly detecting or failing to detect a true relationship is based on the

area under the normal curve that is associated with the sampling distribution for the research oralternative hypothesis. The shape of this curve (its standard error) is the same as that of the nullhypothesis distribution, but it is centered at some non-null value. This curve is on the right in Figure12-7.

When the researcher accepts the H0 when in fact she should have rejected it, Type 2 error iscommitted. Here, the researcher incorrectly concludes that there is no relationship between theindependent and dependent variables, when one actually exists. The researcher has erroneouslyaccepted the null hypothesis. The probability of making this error is indicated by the area under thetail of the sampling distribution for the research hypothesis that exceeds the critical value, in thedirection of the expected value for the null hypothesis, as shown in Figure 12-7.

The probability of correctly rejecting the null hypothesis is indicated by the remaining areaunder the non-null sampling distribution. This area (1 - B) is also sometimes called the “statisticalpower”, since it represents the probability that a researcher will be able to correctly conclude that arelationship exists from examining his sample data.

Effect Size and the Expected Value for the Research



HypothesisHow do we determine the expected value of the alternative research hypothesis? While the

value for expected value of the null hypothesis is usually self-evident (no difference between means;a zero correlation, etc.), there is no such easily obtained value for a research hypothesis. If we’re noteven sure that a relationship exists, how can we state an expected value? The answer lies in realizingthat the expected value for the research hypothesis is essentially an arbitrary decision made by theresearcher. It is not just a number pulled out of a hat, however. There are several motives that aresearcher might have for setting for assigning a particular expected value to the research hypoth-esis.

One way to set the research hypothesis expected value is to set this effect size at some mini-mum meaningful value. In the communication campaign example described earlier, the researchermight decide that any communication campaign that produces less than an average additional $10per year donation is not worth the effort of implementing. Since the population mean before thecampaign was $150 (the null hypothesis expected value), adding the minimum effect size of interest($10) gives a research hypothesis expected value of $160.

Another way of setting the expected value is by examining prior research. If a number ofstudies have indicated that the average correlation between the number of new products producedby an organization, and the number of jobs with a written description is -.35, we might set theresearch hypothesis expected value at -.35.

Once we have set the expected value, we can begin to talk about error in terms of effect size.The difference between the expected value for the null hypothesis and for the research hypothesis isthe effect size. In essence, it summarizes the size of the difference between the situation of no rela-tionship (the null hypothesis) and a significant relationship (the research hypothesis) that we aretrying to detect.

Alpha (á) and Beta (â) Error, Effect Size, and the Number ofObservations

There is an interactive relationship among the levels of a error, B Error, the Effect Size, and theStandard Error of the sampling distribution. Changing any of these values has some effect on theothers. A researcher who understands this interplay can adjust the values to get the maximum valuefrom the research.

Changes in the standard error of the sampling distribution can be made by changing the num-ber of observations (the N) in the sample. The standard error has been defined earlier within thecontext of the Central Limit Theorem, or

Since the dispersion of the data distribution for the population is a constant, and is not affectedby any action that the researcher may take, only the N is involved in determining the dispersion ofthe sampling distribution.

The a and B error levels are arbitrary values set by the researcher. By tradition, the a level, orsignificance level, is frequently set at 5% (p = .05). Likewise, the conventional level for B is .20. Thereare no real reasons for these values—they’ve just emerged because a large number of people haveused them in the past. But the intelligent researcher will set both error values according to therewards and penalties for making each type of error, rather than deferring to tradition. An examplewill help to illustrate the value of considering both error types.

Suppose a researcher in health communication has devised a counseling program for recentheart attack victims. This program uses group sessions, printed brochures with advice about dietand exercise, and periodic monitoring to convince patients to modify their behavior toward morehealthful practices. Two groups are randomly chosen. One will receive the communication programafter discharge from the hospital, and the other will not. The long-term survival rate of patients isused as the dependent variable.



As this is an expensive communication program, the researcher will want to be sure that itworks before recommending implementation of the program for all patients. This will require thata fairly low a error level be used, since the researcher will not want to falsely conclude that theprogram works, when it actually works no better than no intervention at all (Type 1 error). But thereis a penalty for making a Type 2 error, too. If the program really works, the researcher will certainlynot want to conclude that it is ineffective. So the B error level must be set reasonably low, also. Aswe’ll see below, this will be a demanding situation.

Contrast this situation with a similar communication program which just periodically mailsbrochures to recent heart attack patients. This is a low-cost program, so making a Type 1 errorcarries a far lower economic penalty. Here, we may increase the a level, since making a Type 1 erroris fairly inexpensive, while making a Type 2 error (concluding the program has no effect, when itactually works) will be very expensive, as it will fail to save lives by abandoning a working commu-nication program.

As we discussed above, the effect size is also an arbitrary value set by the researcher. Likewise,the number of observations is determined by the researcher, but since data collection requires timeand monetary resources, the researcher will usually want to keep the N at as small a value as pos-sible. Since a, B, effect size, and number of observations interact, we can often trade off the settingsof one or more of these values for the setting of another. At the end of this process, the researcherwill want the optimum settings which will detect the smallest meaningful effect size with the small-est number of observations and with the smallest possible a and B error levels.

Figures 12-8, 12-9 and 12-10 show the effects of changing some of these values. Each figureshould be compared to Figure 12-7, to see the difference that changing one of the values has on theothers.

Figure 12-8 illustrates the effect of increasing the a level of significance, which moves thecritical value closer to the null hypothesis expected value. If the a is increased, the size of region ofrejection is increased, and the probability of making a Type 1 error is increased. But the probabilityof making a Type 2 (B) error is decreased. This illustrates the inverse nature of Type 1 and Type 2error. If all else remains constant, a reduction in the probability of making a Type 1 error will beassociated with an increased probability of making a Type 2 error, and vice versa.



Figure 12-9 illustrates the effect of increasing the N. With more observations, the standarderror of the sampling distribution is reduced, and both a and B error probabilities are reduced forthe same critical value. Alternatively, the a value may be fixed at the same level, and this will de-crease the probability of B erro even more dramatically. Increasing N is obviously a good way toreduce error, but it is also frequently the most expensive decision the researcher can make.

Figure 12-10 illustrates a way to reduce B error without increasing the a error or the number ofobservations. But this also carries a cost. In this figure, the effect size is increased, by moving theresearch hypothesis expected value further from the null hypothesis value. Since the sampling dis-tribution for the null hypothesis is not affected, the a error remains constant. But the center of theresearch hypothesis sampling distribution is shifted to the right, and the critical value that deter-mines the B error level falls further out on the tail of this non-null distribution. In effect, we haveimproved the B error, but only for stronger relationships, i.e., those in which the sampling distribu-tion for the research hypothesis is assumed to be centered further from the sampling distribution ofthe null hypothesis.

Power AnalysisSince the a, B, effect size and N are functionally related, setting the values for any three of the

four will determine exactly the value of the fourth. This fact can be used in designing research andin analyzing research results. We’ll illustrate two typical uses of “power analysis”, which is thegeneric term for investigating the tradeoffs of error, effect size, and sample size.

One of the most useful ways to use power analysis is to determine the number of observationswhich are needed to test a hypothesis. If we collect too few observations, the standard error of thesampling distributions will be large, and we’ll have large a and/or B errors for anything except verylarge effect sizes. If we collect too many observations, we are wasting time and money by goingbeyond the error levels with which we would usually feel comfortable in drawing conclusions, orby having effect sizes which are trivial.

To determine the number of observations, we must set the a level, the B level (or its comple-ment, the Power (1 - B)), and the effect size. Once we have set these values, we can determine exactly



the number of observations that we must have in our sample.Using the health counseling example, suppose we decide that the cost of the extensive pro-

gram means that we want 100:1 odds that any significant effect we detect in the sample is reallypresent in the population, and is not really due to random sampling error. So we set the a level ofsignificance at .01. Next, we determine that the value of saving lives with this program means thatwe want no more than a 5% chance of missing a real effect of the communication program, if it ispresent in the population. This sets the B error level at .05 and the Power at .95. Finally, we wish tosee at least a 10% improvement in patient mortality as a result of the program. We will not worryabout detecting improvements smaller than this with the same error levels, as they are probably notjustified by the cost of the program. This sets the effect size.

By referring to standard tables of statistical power, such as those found in Cohen (1977) orKraemer and Thiemann (1987), we can now find the N required to meet these conditions. An abbre-viated power table is shown in Appendix B. As the table indicates, we will need about 315 observa-tions in order to be confident that we will have only a 1 in 100 chance of falsely concluding that acommunication effect exists when it does not; and only a 1 in 20 chance of missing a 10% or greaterimprovement in patient mortality.

The standard deviation of the survival rate of patients is already knowne from published healthstatistics. Assume it is 50 percent. Using the procedure in Appendix B, we look down the a=.01column and across the Power=.95 row to get the standardized effect, zdiff=3.971. Using the compu-tational formula in Appendix B,

We will need about 394 observations in order to be confident that we will have only a 1 in 100chance of falsely concluding that a communication effect exists when it does not; and only a 1 in 20chance of missing a 10 percent or greater improvement in patient mortality.

Another common use of power analysis occurs when we fail to reject a null hypothesis. Wemight fail to reject the hypothesis for two reasons: there really isn’t a relationship; or we do not have



enough statistical power (1 - B) to detect the relationship with any confidence.Looking at the organizational change and formalization example again, we found that the

correlation between the number of new products and the percentage of jobs with written specifica-tions (-.44) did not fall in the region of rejection. Possibly there is no real relationship between thetwo variables, and that’s why we failed to find a correlation beyond the critical value. But it is alsopossible the relationship exists, and that the number of observations was so low that we missedgetting a large enough correlation simply because of random sampling error.

The level of significance (a error) was set to .05 in that example, and the N was 8. If we set theeffect size at .44 (the observed size of the relationship), we can use the power tables to find the Berror probability, and the statistical power (1 - B). As we can see from the tables, the power is onlyabout 10%, which means that our B error probability is 90%! We will incorrectly conclude that thereis a null relationship, when the relationship is actually r = +.45 or greater, about 9 times in 10! Weobviously do not want to accept the null hypothesis under these conditions. We are better off with-holding judgment until more observations can be collected.

SummaryThe process of hypothesis testing establishes the probable truth or probable falsehood of the

hypothetical statements developed during theory construction. The general logic of hypothesis testingis to contrast the predictions of our research hypothesis HR with the predictions of a competinghypothesis which states that there is no relationship (the null hypothesis H0.)

To make the comparison, a sampling distribution is constructed, centered around the value ofthe statistic which we expect to occur, if the null hypothesis is correct. Next, we set a level of signifi-cance for the hypothesis test. This level states the error probability which we are willing to accept.For example, we may set the probability at .05, meaning that we will reject the null hypothesis if ourresults indicate that there is only a 5% or less chance of this decision being incorrect.

This error level (called a or Type 1 error) corresponds to an area under the normal curve thatrepresents the sampling distribution. By converting this area to a critical value, we have criterion forcomparison with our sample statistic. If the value of the sample statistic is beyond the critical value,we can reject the null hypothesis, and accept the alternative research hypothesis.

Both comparative and relationship hypotheses can be tested using this approach. Samplingdistributions can be created for group difference statistics like the difference between means, or forcovariance statistics like correlation coefficients. The procedure for contrasting research and nullhypotheses is the same in both cases.

While the a error level, or level of significance, describes the probability of falsely concludingthat a relationship exists, when there really is no relationship, B or Type 2 error describes the prob-ability of concluding that there is no relationship, when one really exists. The probability of thiserror depends on the effect size, which is the distance between the expected value of the researchhypothesis and the expected value of the null hypothesis, and on the number of observations.

Power analysis examines the tradeoffs between a error, B error, effect size, and the N. It can beused to determine the number of observations which are required to achieve suitably low errorlevels, to determine tradeoffs between Type 1 and Type 2 error, or to examine null results to find outif they were nonsignificant because of too few observations.

References and Additional Readings

Cohen, J. (1977). Statistical power analysis of the behavioral sciences. New York: Academic Press. (Chap-ter 1, “The Concepts of Power Analysis”).

Cohen, J. & Cohen, P. (1983). Applied multiple regression/correlation analysis for the behavioral sciences.(Chapter 2, “Bivariate correlation and regression”).

Hays, W.L. (1981). Statistics (3rd. Ed.). New York: Holt, Rinehart & Winston. (Chapter 7, “Hypoth-esis Testing”).



Kerlinger, F.N. (1986). Foundations of behavioral research (3rd ed.) New York: Holt, Rinehart and Win-ston. (Chapter 11, “Purpose, Approach, Method”; Chapter 12, “Testing Hypotheses and theStandard Error”).

Keppel, G. (1982). Design and analysis: A researcher’s handbook. Engelwood Cliffs, NJ: Prentice-Hall.(Chapter 2, “Specifying Sources of Variability”).

Kraemer, H.C. & Thiemann, S. (1987). How many subjects: Statistical power analysis in research. NewburyPark, CA: Sage.


chapter 12 testing hypotheses - cios

Documents