statistics basics

Lesson 1

Measures of Central Tendency:The Mean, Median, and Mode

One of the most basic purposes of statistics is simply to enable us to make sense of large numbers. For example, if you want to know how the students in your school are doing in the statewide achievement test, and somebody gives you a list of all 600 of their scores, that’s useless. This everyday problem is even more obvious and staggering when you’re dealing, let’s say, with the population data for the nation.

We’ve got to be able to consolidate and synthesize large numbers to reveal their collective characteristics and interrelationships, and transform them from an incomprehensible mass to a set of useful and enlightening indicators.

The Mean

One of the most useful and widely used techniques for doing this—one which you already know—is the average, or, as it is know in statistics, the mean. And you know how to calculate the mean: you simply add up a set of scores and divide by the number of scores. Thus we have our first and perhaps the most basic statistical formula:

Where:

(sometimes call the X-bar) is the symbol for the mean.

(the Greek letter sigma) is the symbol for summation.

X is the symbol for the scores.

N is the symbol for the number of scores.

So this formula simply says you get the mean by summing up all the scores and dividing the total by the number of scores—the old average, which in this case we’re all familiar with, so it’s a good place to begin.

This is pretty simple when you have only a few numbers. For example, if you have just 6 numbers (3, 9, 10, 8, 6, and 5), you insert them into the formula for the mean, and do the math:

But we usually have many more numbers to deal with, so let’s do a couple examples where the numbers are larger, and show how the calculations should be done. In our first example, we’re going to compute the mean salary of 36 people. Column A of Table 1 show the salaries (ranging from $20K to $70K), and column B shows how many people earned each of the salaries.

Table 1

Example 1 of Method for Computing the Mean

A B CSalary (X) Frequency (f) fX

$20k 1 20$25K 2 50$30K 3 90$35K 4 140$40K 5 200$45K 6 270$50K 5 250$55K 4 220$60K 3 180$65K 2 130$70K 1 70Sum 36 1,620

To get the for our formula, we multiply the number of people in each salary category by the salary for that category (e.g., 1 x 20, 2 x 25, etc.), and then total those numbers (the ones in column C). Thus we have:

And this is how the distribution of these salaries looks:

Figure 1Distribution of Example 1 Salaries

The scores in this distribution are said to be normally distributed, i.e., clustered around a central value, with decreasing numbers of cases as you move to the extreme ends of the range. Thus the term normal curve.

So, computing the mean is pretty simple. Piece of cake, right? Not so fast.

In our second example, let’s look what happens if we change just six people’s salary in Table 1. Let’s suppose that the three people who made $60K actually made $220K, and that the two who made $65K made $205K, and the one person who made $70K made $210K. The revised salary table is the same except for these changes.

Table 2Example 2 of Method for Computing the Mean


$20k 1 20$25K 2 50$30K 3 90$35K 4 140$40K 5 200$45K 6 270

$50K 5 250$55K 4 220$200K 3 600$205K 2 410$210K 1 210Sum 36 2,460

But before we recompute the mean, let’s look at how different the distribution looks.

Figure 2Distribution of Example-2 Salaries

Now, using the revised numbers in Table 2, we compute the mean as follows:

What this shows is that changing the salaries of just six individuals to extreme values greatly affects the mean. In this case, it raised the mean from $45K to $68.3K (an increase of 52%), even though all the other scores remained the same. In fact, the mean is a figure

that no person in the group has—hardly a figure we would think of as "average" for the group.

The important lesson here is that the mean is intended to be a measure of central tendency, but it works usefully as such only if the data on which it is based are more or less normally distributed (like in Figure 1). The presence of extreme scores distorts the mean, and, in this case, gives us a mean salary ($68.3K) that is not a very good indication of the "average" salary of this group of 36 individuals.

So if we know or suspect that our data may have some extreme scores that would distort the mean, what measure can we use to give us a better measure of central tendency? One such measure is the median, and we move on to learn about that now.

The Median

If your data are normally distributed (like those in Figure 1), the preferred measure of central tendency is the mean. However, if your data are not normally distributed (like those in Figure 2), the median is a better measure of central tendency, for reasons we’ll see in a moment.

The median is the point in the distribution above which and below which 50% of the scores lie. In other words, if we list the scores in order from highest to lowest (or lowest to highest) and find the middle-most score, that’s the median.

For example, suppose we have the following scores: 2, 12, 4, 11, 3, 7, 10, 5, 9, 6. The next step is to array them in order from lowest to highest.

2345679

101112

Since we have 10 scores, and 50% of 10 is 5, we want the point above which and below which there are five scores. Careful. If you

count up from the bottom, you might think the median is 6. But that’s not right because there are 4 scores below 6 and 5 above it. So how do we deal with that problem? We deal with it by understanding that in statistics, a measurement or a score is regarded not as a point but as an interval ranging from half a unit below to half a unit above the value. So in this case, the actual midpoint or median of this distribution—the point above which and below which 50% of the scores lie—is 6.5

As we saw with the mean, when we have only a few numbers, it’s pretty simple. But how do we find the median when we have larger numbers and more than one person with the same score? It’s not difficult. Let’s use the salary data in Table 1.

Table 3Example 1 of Method for Computing the Median

Salary Range Frequency $20K $19.5K-20.5K 1$25K $24.5K-25.5K 2$30K $29.5K-30.5K 3$35K $34.5K-35.5K 4$40K $39.5K-40.5K 5$45K $44.5K-45.5K 6$50K $49.5K-50.5K 5$55K $54.5K-55.5K 4$60K $59.5K-60.5K 3$65K $64.5K-65.5K 2$70K $69.5K-70.5K 1Sum 36

The salaries are already in order from lowest to highest, so the next step in finding the median is to determine how many individuals (ratings, scores, or whatever) we have. Those are shown in the frequency column, and the total is 36. So our N = 36, and we want to find the salary point above which and below which 50%, or 18, of the individuals fall. If we count up from the bottom through the $40K level, we have 15, and we need three more. But if we include the $45K level (in which there are 6), we have 21, three more than we need. Thus, we need 3, or 50%, of the 6 cases in the $45K category. We add this value (.5) to the lower limit of the interval in which we know the median lies ($44.5K-$45.5K), and this gives us value of $45K.

In this case, the mean and the median are the same—as they always are in normal distributions. So in situations like this, the mean is the preferred measure.

But things aren’t always so neat and tidy. Let’s now compute the median for the salary data in Table 2, which we know (from Figure 2) are not normally distributed.

Table 4Example 2 of Method for Computing the Median

Salary Range Frequency $20k $19.5K-20.5K 1$25K $24.5K-25.5K 2$30K $29.5K-30.5K 3$35K $34.5K-35.5K 4$40K $39.5K-40.5K 5$45K $44.5K-45.5K 6$50K $49.5K-50.5K 5$55K $54.5K-55.5K 4$200K $199.5K-200.5K 3$205K $204.5K-205.5K 2$210K $209.5K-210.5K 1Sum 36

The N is the same (36), so we go through exactly the same calculations we did for the data in Table 3. When we do that (count up from the bottom, find that we need half the cases in the $45K category to get 50% (18) of the total, and do so by adding .5 to the lower limit of that category), incredibly we get exactly the same result ($45K) we did with the data in Table 3. In other words, those six extreme cases (the six whose salaries changed from $60K, $65K, and $70K to $200K, $205K, and $210K) don’t affect the median even though they made a big change in the mean. They are still above the midpoint, and it doesn’t matter how much above it in the calculation of the median.

This example illustrates dramatically what the median is and why it’s a better measure of central tendency than the mean when we have extreme scores.

We’ve done the calculations for the median in a simple, descriptive way (arraying the scores from high to low, counting up to the mid-

category, dividing it as necessary, etc.), but just so you won’t feel slighted, here is the statistical formula for doing what we’ve just done.

Where:

Mdn is the median.

L is the lower limit of the interval containing the median.

N is the total number of scores.

is the sum of the frequencies or number of scores up to the interval containing the median.

fw is the frequency or number of scores within the interval containing the median.

i is the size or range of the interval.

The Mode

The third and last of the measures of central tendency we’ll be dealing with in this course is the mode. It’s very simple: The mode is the most frequently occurring score or value. In our case (see Figures 1 and 2), that value is 45K. But sometimes we may have odd distributions in which there may be two peaks. Even if the peaks are not exactly equal, they’re referred to as bi-modal distributions.

Let’s assume we have such a bi-modal distribution of salaries as shown in Table 5 and Figure 3.

Table 5Bi-Modal Distribution of Salaries


$20K 1 20$25K 3 75

$30K 4 120$35K 6 210$40K 3 120$45K 1 45$50K 3 150$55K 5 275$60K 6 360$65K 3 195$70K 1 70Sum 36 1,640

Figure 3Example of a Bi-Modal Distribution

Before we talk about the mode, using the formulas and calculation procedures you’ve just learned, calculate the mean and median for the salaries in Table 5 (the fx and the data are in Column C).

When you look at this distribution of salaries, as shown graphically in Figure 3, it’s hard to discern any central tendency. The mean (which you just calculated) is $45K, which only one person earns, and the median is also $45K, which, while it’s the middle-most value (50% of the cases are above and below it), certainly doesn’t give us a meaningful indication of the central tendency in this distribution—because there isn’t any.

Therefore, the most informative general statement we can make about this distribution is to say the it is bi-modal.

You now know the three principal measures of central tendency—the mean, the median, and the mode—when they should be used, and how to calculate them, so we now move on to the other side of the central-tendency coin: dispersion.

Lesson 2

The Standard Deviation and the Normal Curve

A Measure of Dispersion: The Standard Deviation

For various important reasons we'll see as we get further into this course, we often want to know not only what the central tendency is in a set of scores or values (i.e., the mean, the median, or the mode), we also want to know how bunched up or spread out the scores are. The most widely used indicator of dispersion is the standard deviation which, in a nutshell, is based on the deviation of each score from the mean.

To illustrate, compare the distribution of test scores in Figures 4 and 5. The first is flat and spread out, while the second is concentrated and bunched up closely around the mean.

Figure 4Graphic Display of Flat or Spread-Out Score Distribution

Figure 5Display of a Narrow or Concentrated Distribution

Note that he mean and median of these two quite different distributions are the same ( = 150, Mdn = 150), so simply calculating and reporting those two measures of central tendency would fail to reveal how different the dispersion of scores is between the two groups. But we can do this by calculating the standard deviation.

The standard deviation provides us with a measure of just how spread out the scores are: a high standard deviation means the scores are widely spread; a low standard deviation means they're bunched up closely on either side of the mean.

We'll now calculate the standard deviation for both these distributions. The formula for the standard deviation is:

Where:

(little sigma) is the standard deviation.

d2 is a score's deviation from the mean squared.

is the number of cases.

The numbers we need to calculate the standard deviation for Figure 4, the flat distribution, are in Table 6.

Table 6Data for Figure 4—the Flat Distribution

A B C D ETest Score

(X)Frequency

(f)X–Mean (d) fd fd2

100 8 50 400 20,000 110 13 40 520 20,800 120 17 30 510 15,300 130 20 20 400 8,000 140 21 10 210 2,100 150 22 0 0 0 160 21 -10 -210 2,100 170 20 -20 -400 8,000 180 17 -30 -510 15,300 190 13 -40 -520 20,800 200 8 -50 -400 20,000

SUM 180 132,400

Column A displays the test scores (X).

Column B shows how many people got each test score (f).

Column C is the test score minus the mean (X minus the mean or d).

Column D is the sum of the deviations in column C (fd).

Column E contains the squares of all the deviations.

Of course, to get the deviation of each score from the mean (column C), we have to calculate the mean, and you already know how to do that. We now have what we need to calculate the standard deviation for the flat distribution in Figure 4:

or

You can do the last part of this calculation, the square root of 132,400/180 (which is 736) by using the square-root button on your little hand calculator.

Now let's compute the standard deviation for the data in Figure 5. The data are in Table 7, and you follow the same steps we've just completed.

Table 7Example of a Narrow or Concentrated Distribution

A B C D ETest Score

(X)Frequency

(f)X - Mean

(d)fd fd2

100 0 50 0 0 110 0 40 0 0 120 0 30 0 0 130 10 20 200 4,000 140 45 10 450 4,500 150 70 0 0 0 160 45 -10 -450 4,500 170 10 -20 -200 4,000 180 0 -30 0 0 190 0 -40 0 0 200 0 -50 0 0

SUM 180 17,000

or

The two standard deviations provide a statistical indication of the how different the distributions are: 27 for the spread-out distribution and 10 for the bunched-up distribution.

So once we know the mean and median, why do we need to know the standard deviation? What use is it?

The standard deviation is important because, regardless of the mean, it makes a great deal of difference whether the distribution is spread out over a broad range or bunched up closely around the mean. For example, suppose you have two classes whose mean reading scores are the same. With only that information, you would be inclined to teach the two classes in the same way. But suppose you discover that the standard deviation of one of the classes is 27 and the other is 10, as in the examples we just finished working with. That means that in the first class (the one where 27), you have many students throughout the entire range of performance. You'll need to have teaching strategies for both the gifted and the challenged. But in the second class (the one where = 10), you don't have any gifted or challenged students. They're all average, and your teaching strategy will be entirely different.

The Normal Curve

Before we leave the standard deviation, it's a good time to learn a little more about the normal curve. We'll be coming back to it later.

First, why is it called the normal curve? The reason is that so many things in life are distributed in the shape of this curve: IQ, strength, height, weight, musical ability, resistance to disease, and so on. Not everything is normally distributed, but most things are. Thus the term normal curve.

In Figure 6, we have a set of scores which are normally distributed. The range is from 0 to 200, the mean and median are 100, and the standard deviation is 20. In a normal curve, the standard deviation indicates precisely how the scores are distributed. Note that the percentage of scores is marked off by standard deviations on either side of the mean. In the range between 80 and 20 (that’s one standard deviation on either side of the mean), there are 68.26% of the cases. In other words, in a normal distribution, roughly two thirds of the scores lie between one standard deviation on either side of the mean. If we go out to two standard deviations on either side of the mean, we will include 95.44% of the scores; and if we go out three standard deviations, that will encompass 98.74% of the scores; and so on.

Another way to think about this is to realize that in this distribution, if you have a score that’s within one standard deviation of the mean, i.e., between 80 and 120, that’s pretty average—two thirds of the people are concentrated in that range. But if you have a score that’s two or three standard deviations away from the mean, that is clearly a deviant score, i.e., very high or very low. Only a small percent of the cases lie that far out from the mean.

This is valuable to understand in its own right, and will become useful when we take up determining the significance of difference between means—which we’re going to do next in Lesson 3.

Figure 6Normal Curve Showing the Percent of Cases Lying Within 1, 2, and 3 Standard Deviations From the Mean

Lesson 3

Testing the Difference Between Means: The t-Test

This is one of the most important parts of this course in basic statistics. Here we’re going to learn about testing the significance of difference between means. What does that mean?

Suppose you’re the superintendent, and one of your principals bursts into your office enthusiastically and says, "I know you’ll be happy to learn that after our big effort this year in reading, my third graders improved from 187 to 195 on the state reading test!"

You immediately ask her, "Is the 8-point difference between those means statistically significant?" When her eyes glaze over and she says, "Huh?" you smile, forebearingly, (because you’ve taken this course in basic statistics, and she hasn’t), and you patiently explain to her that simply because there is a numerical difference between last year’s and this year’s mean scores doesn’t mean that there is real difference. It could be due to chance variation in the scores.

So how do we know when the difference between two means is probably a real difference, not one due to chance? We have to say "probably" because nothing in statistics is absolutely certain (as is the case with most things in life). But there are statistical tests which can tell us how likely a difference between two means is due to chance.

One of the most widely used statistical methods for testing the difference between means, and the one we’re going to get you up-to-speed on, is called the t-test.

Let’s go back to the salary data we worked with in Table 1 of Lesson 1, but now let’s compare the mean salary of that group with another group, and ask whether the mean salaries of the two groups are significantly different.

First, let’s look at the formula for the t-test, and determine what we need to make the computation:

Where:

is the mean for Group 1.

is the mean for Group 2.

is the number of people in Group 1.

is the number of people in Group 2.

is the variance for Group 1.

is the variance for Group 2.

The only thing in this formula you’re not familiar with is the symbol s2, which stands for the variance. The variance is the same as the standard deviation without the square root, i.e., it’s nothing more than the sum of the deviations of all the scores from the mean divided by n-1.

The formula above is for testing the significance of difference between two independent samples, i.e., groups of different people. If we wanted to test the difference between, say, the pre-test and post-test means of the same group of people, we would use a different formula for dependent samples. That formula is:

Where:

is the sum of all the individuals’ pre-post score differences.

is the sum of all the individuals’ pre-post score differences squared.

is the number of paired observations.

But for now, we’ll test the significance of difference between the mean salary of two different groups. You can try the one for dependent samples on your own. (I knew you’d welcome that opportunity.)

Tables 8 and 9 provide the numbers we need to compute the t-test for the difference in mean salaries of the two groups.

Table 8Salaries and t-Test Calculation Data for Group 1

A B C D ESalary (X) Frequency (f) X - Mean (d) fd fd2

20 1 25 25 625 25 2 20 40 800 30 3 15 45 675 35 4 10 40 400 40 5 5 25 125 45 6 0 0 0 50 5 -5 -25 125 55 4 -10 -40 400 60 3 -15 -45 675 65 2 -20 -40 800 70 1 -25 -25 625

SUM 36 5,250

The variance (s2)

Table 9Salaries and t-Test Calculation Data for Group 2

A B C D ESalary(X) Frequency (f) X - Mean (d) fd fd2

20 0 27 0 0 25 2 22 44 968 30 3 17 51 867 35 3 12 36 432 40 4 7 28 196 45 6 2 12 24 50 6 -3 -18 54 55 5 -8 -40 320 60 3 -13 -39 507 65 2 -18 -36 648 70 2 -23 -46 1,058

SUM 36 5,074

The variance (s2)

You can see from a quick inspection of the two tables that the salary distributions are similar. There a few more people making higher salaries. The mean of the second group (which has been calculated for you) is slightly higher (47 vs. 45 for the first group). And the variance is smaller (145 vs. 150). So let’s plug the numbers into the t-test formula and see what we get.

We now know that t = .222. So what does that mean? Is the difference between the two means statistically significant or not? To find out whether a t-test of any value is significant or not, we simply look it up in a table that can be found in the appendices of any statistical text book. The quick answer in this case is no, it is not statistically significant. That is, the 2-point difference in the mean salaries of these two groups could likely have occurred by chance.

But that’s the quick and dirty answer. There’s more about the matter of statistical significance we need to understand. So we’re

going to that important topic now, and we’ll return to this example after we’ve done that.

Lesson 4

Statistical Significance and

The Type I and Type II Errors

Certainty and Uncertainty—Universes and Samples

Why do we have to use statistical tests, anyway? When we have two groups with different means, why can’t we just say that one is higher than the other, and that’s it? The reason is that the difference between the means of the two groups may be due to chance, and if we were to make the comparison again, the difference might be turned around.

How can that be? The two main reasons are sampling and measurement error. The particular sample we have may not be

representative of the universe from which it is drawn. Also, tests and measuring instruments are not perfect.

For example, suppose that within the next hour we could somehow magically measure the height of every adult man and woman in the world, and we found that the mean height of the men was 5’6", and the mean for the women was 5’3". Since we have measured the entire universe of adult men and women, those are the averages, not estimates of them based on samples. We don’t need to run a t-test to see if the 3" difference between the means is statistically significant. That is the difference.

But if—as is almost always the case in whatever we do—we have to use a sample, we have to account for the fact that the sample, no matter how carefully drawn, may not be representative of the universe. Usually it is, but sometimes it’s not.

A good way to understand this important point is to realize that if we were to take 100 random samples of a 1,000 people each, the means of those samples would form a normal curve (just like the ones we worked on in Lesson 2). In other words, the means of some of those samples would be as much (or more) than 3 standard deviations on either side of the collective mean of the 100,000 people.

When we take just one sample, which is what we usually have to work with, the chances are it’s close to the real mean, simply because most of the values are clustered close to the mean (remember, 68% of the values are within ± 1 standard deviation from the mean). But we can’t be sure. The sample we’re working with just might be one of those that’s lying out at the extremes of the normal curve.

That’s why we have tests of statistical significance. They can’t tell us for sure whether the means we’re comparing are close to the true mean, but they can give us a good estimate or probability of whether that’s the case.

Scientific Knowledge and the Null Hypothesis

As you’ve probably realized by now, scientists and statisticians understand that error and uncertainty are inevitable, but they’re very uncomfortable with it. Thus, one of the basic tenets of science, which is reflected in statistics, is the requirement that nothing be admitted into the body of scientific knowledge unless we’re as sure as we can be that it’s true. In other words, there is a strong

conservative bias in science and statistics. Scientists would rather be guilty of waiting until there’s more evidence to be sure than to accept a finding prematurely and be wrong. In statistics, this takes the form of what is called the "null hypothesis." Basically, the null hypothesis says that whenever you are, for example, setting out to compare the difference between two means, you begin with the assumption—indeed, the assertion—that there is no difference between the means. And in order to conclude that there is a difference, your task is to disprove the null hypothesis.

Levels of Significance

Now this leads to a very difficult decision. And to understand the difficulty, let’s first go back to the t-test of the two means we ran in Lesson 3. We found that, for that test, t = .222. In order to find out if the difference between the means is statistically significant (i.e., how likely it is that it is due to chance), we look up the value of t in one of the statistical significance table that are found in the appendices of all statistics texts. The t-test table we need is reproduced below.

Table 10t-Test Values Required to Reject the Null Hypthothesis at the .05 and .01 Levels of Confidence (Two-Tailed Test)

__________________________________________________________________________

Degrees of Freedom (df) .05 .01

__________________________________________________________________________

20 2.09 2.85

21 2.08 2.83

22 2.07 2.82

23 2.07 2.81

24 2.06 2.80

25 2.06 2.79

26 2.06 2.78

27 2.05 2.77

28 2.05 2.76

29 2.05 2.76

30 2.04 2.75

35 2.03 2.72

40 2.02 2.71

45 2.01 2.70

50 2.01 2.68

55 2.00 2.67

60 2.00 2.66

65 2.00 2.66

70 2.00 2.65

75 1.99 2.64

80 1.99 2.64

85 1.99 2.64

90 1.99 2.63

95 1.99 2.63

100 1.98 2.63

Infinity 1.96 2.58

________________________________________________________________________

In order to use this table, we enter it with our t value (.222) and something called "degrees of freedom." The degrees of freedom is simply n1–1+n2–1 or, in our case, 70. Note that there are two columns of t values, one labeled .05, and the other labeled .01. If we go down to the degrees of freedom nearest to ours, which would be 70, we find that both the .05 and the .01 t values are substantially larger than our .222. So we didn’t achieve a large enough t value to

reject the null hypothesis, i.e., to be able to conclude that the difference wasn’t due to chance.

Why do we have two columns, one labeled .05 and the other .01? Because those are the two levels of significance commonly used in statistical analysis. The t values in the .05 column are likely to occur by chance 5 percent of the time, whereas the t values in the .01 column are likely to occur by chance only 1 percent of the time.

Type I and Type II Errors

The choice of what significance level to use (.05, .01, or lower or higher) is the difficult choice that you as the researcher must make. If you decide to accept the .05 level of confidence, which requires a smaller t value, you can more easily reject the null hypothesis and declare that there is a statistically significant difference between the means than if you select the .01 level, but you will be wrong 5 percent of the time. This is the Type I error.

On the other hand, if you select the .01 value, you will be wrong only 1 percent of the time. But since the .01 value requires a larger t value, you will less often be able to reject the null hypothesis and say that there is a statistically significant difference between the means when in fact that is the case. This is the Type II error. It is crucially important to an understanding of even basic statistics that we have a clear understanding of these two errors. If you spend a little time with Table 10, it will help you achieve this understanding.

Table 11.Accepting and Rejecting Null Hypotheses and the Making of Type I and Type II Errors*

Decision

Accept The NullHypothesis

Reject The NullHypothesis

A

The null hypothesis is really true, i.e., there is

1

You accepted the null hypothesis when it is true, i.e., you concluded that

2

You rejected the null hypothesis when it is true, i.e., you concluded that there is a real difference

not a real difference between the means of the two groups.

there is not a real difference between the means of the two groups which, in fact, is the case. That was a good decision.

between the means of the two groups when, in fact, there is not a difference. That was a bad decision. You made the Type I error.

B

The null hypothesis is really false, i.e., there is a real difference between the means of the two groups.

3

You accepted the null hypothesis when it is false, i.e., you concluded that there is not a real difference between the means of the two groups when in fact there is a real difference. That was a bad decision. You made the Type II error.

4

You rejected the null hypothesis when it is true, i.e., you concluded that there is a real difference between the means of the two groups which, in fact, is the case . That was a good decision.

*This table was adapted from a similar one found in Neil Salkind’s Statistics for People Who (Think They) Hate Statistics, Sage Publications, 2000, p. 176.

Table 11 and the work we’ve done in this lesson make the mysteries of statistical significance and Type I and Type II errors transparently clear. When you’re reading a professional journal and you encounter a discussion of the difference between the means of two groups, and the authors conclude by saying, t = 2.64 p < .01, df = 70, two tailed test, you will immediately know that:

1. The t-test of the two means yielded a t value of 2.64. 2. A t value of 2.64 with df = 70 for independent samples (the

two-tailed rather than the one-tailed test) is statistically significant beyond the .01 level of confidence, i.e., likely to occur by chance less than 1 in 100 times.

So, this knowledge is a major step forward in your journey to master basic statistics. And we’ve got a few more neat things to cover.

Lesson 5

The Effect Test

Take another look at Table 10 in Lesson 4 which provides the significance levels for the t-test. You probably noticed that the size of the t value needed to reject the null hypothesis (and enable you to declare that there is a statistically significant difference between two means) is dependent on the size of the samples on which the means are based. With df = 20, you need a t value of 2.09 to reach the .05 level of significance; but with df = 100, you need a t value of only 1.98.

In other words, if you have small Ns, you will need a large difference between the means to achieve statistical significance; but if you have very large Ns, you will need only a very small difference to be able to declare that the difference between the means is statistically significant.

So why is that of more than technical interest? Because we don’t want to mistake statistical significance for educational significance. Suppose you are comparing the mean reading scores of students in your traditional program with those in a new reading program. There are 500 students in each program, and at the end of the year there is a 3 point difference favoring the new program, and that 3 point difference is statistically significant beyond the .001 level of confidence. The proponents of the new program are likely to cite that finding as clear research evidence of the superiority of the new program and call on you, as the superintendent, to junk the traditional program, even though the new program is substantially more costly.

But you should be wary of that recommendation. Why? Because the mean difference in reading scores, even though it’s statistically significant, is very small. Is a 3 point difference likely to have any practical significance, or even be observable? Probably not. Even if the difference were a few points greater, would such a difference justify the expenditure of substantially more funds? Probably not.

It turns out that statisticians have developed a test that is intended to give some help when confronting the question of whether a difference between two means is of practical consequence. It’s called the Effect Test.

The formula for the Effect Test is:

Where:

E is the effect size.

is the mean of Group 1.


is the standard deviation of Group 1.


As you can see, the formula is simply dividing the difference between the two means by the average of the score variability in the two groups.

There is a general consensus that an effect size of .33 or greater indicates that the difference has practical meaning or significance.

Let’s do an example.

We have two groups with mean reading test scores of 188 and 185 and standard deviations of 30 and 32. N = 500 for both groups, and the difference between the means is statistically significant. We plug the numbers into the Effect Test formula as follows:

or

The effect size does not reach the .33 level, so the 3 point difference between the means would not be regarded as practically consequential, even though it’s statistically significant.

But suppose the two means are 193 and 182, and the Ns and standard deviations are the same. Then we have:

or

The 11 point difference between the means (with the associated score variability as reflected in the standard deviations) exceeds the .33 threshold for practical significance. So in this case, we would be justified in saying that the difference between the two groups is not only statistically significant, it can also be regarded as having some practical educational meaning.

However, in the final analysis, you, as the experienced educator and administrator, must make the judgment about practical meaning. Many times you will be presented with mean differences that are large by any practical standard, but because of small Ns or large variances, they’re not statistically significant. In those cases, the judgment is fairly easy: You would be on very soft ground making policy and budgetary decisions based on differences that are not statistically significant.

But the other case is more difficult. If you have a mean difference that is both statistically significant and practically significant as indicated by the effect size, you still have to be the judge of whether that difference justifies changing programs, spending more money, hiring or firing staff, and so on.

The new knowledge you now have about how to determine statistical and practical significance adds greatly to your ability to make decisions about the effectiveness of educational programs and the formulation of educational policies, but there are no automatic answers. You, as the responsible administrator, must bring your experience to bear in making the final decision.

Lesson 5

The Effect Test

Take another look at Table 10 in Lesson 4 which provides the significance levels for the t-test. You probably noticed that the size of the t value needed to reject the null hypothesis (and enable you to declare that there is a statistically significant difference between two means) is dependent on the size of the samples on which the means are based. With df = 20, you need a t value of 2.09 to reach the .05 level of significance; but with df = 100, you need a t value of only 1.98.

In other words, if you have small Ns, you will need a large difference between the means to achieve statistical significance; but if you have very large Ns, you will need only a very small difference to be able to declare that the difference between the means is statistically significant.

So why is that of more than technical interest? Because we don’t want to mistake statistical significance for educational significance. Suppose you are comparing the mean reading scores of students in your traditional program with those in a new reading program. There are 500 students in each program, and at the end of the year there is a 3 point difference favoring the new program, and that 3 point difference is statistically significant beyond the .001 level of confidence. The proponents of the new program are likely to cite that finding as clear research evidence of the superiority of the new

program and call on you, as the superintendent, to junk the traditional program, even though the new program is substantially more costly.

But you should be wary of that recommendation. Why? Because the mean difference in reading scores, even though it’s statistically significant, is very small. Is a 3 point difference likely to have any practical significance, or even be observable? Probably not. Even if the difference were a few points greater, would such a difference justify the expenditure of substantially more funds? Probably not.

It turns out that statisticians have developed a test that is intended to give some help when confronting the question of whether a difference between two means is of practical consequence. It’s called the Effect Test.

The formula for the Effect Test is:

Where:

E is the effect size.





As you can see, the formula is simply dividing the difference between the two means by the average of the score variability in the two groups.

There is a general consensus that an effect size of .33 or greater indicates that the difference has practical meaning or significance.

Let’s do an example.

We have two groups with mean reading test scores of 188 and 185 and standard deviations of 30 and 32. N = 500 for both groups, and the difference between the means is statistically significant. We plug the numbers into the Effect Test formula as follows:

or

The effect size does not reach the .33 level, so the 3 point difference between the means would not be regarded as practically consequential, even though it’s statistically significant.

But suppose the two means are 193 and 182, and the Ns and standard deviations are the same. Then we have:

or

The 11 point difference between the means (with the associated score variability as reflected in the standard deviations) exceeds the .33 threshold for practical significance. So in this case, we would be justified in saying that the difference between the two groups is not only statistically significant, it can also be regarded as having some practical educational meaning.

However, in the final analysis, you, as the experienced educator and administrator, must make the judgment about practical meaning. Many times you will be presented with mean differences that are large by any practical standard, but because of small Ns or large variances, they’re not statistically significant. In those cases, the judgment is fairly easy: You would be on very soft ground making policy and budgetary decisions based on differences that are not statistically significant.

But the other case is more difficult. If you have a mean difference that is both statistically significant and practically significant as indicated by the effect size, you still have to be the judge of whether that difference justifies changing programs, spending more money, hiring or firing staff, and so on.

The new knowledge you now have about how to determine statistical and practical significance adds greatly to your ability to make decisions about the effectiveness of educational programs and

the formulation of educational policies, but there are no automatic answers. You, as the responsible administrator, must bring your experience to bear in making the final decision.

Lesson 6

Correlation

What is a Correlation?

Thus far we’ve covered the key descriptive statistics—the mean, median, mode, and standard deviation—and we’ve learned how to test the difference between means. But often we want to know how two things (usually called "variables" because they vary from high to low) are related to each other.

For example, we might want to know whether reading scores are related to math scores, i.e., whether students who have high

reading scores also have high math scores, and vice versa. The statistical technique for determining the degree to which two variables are related (i.e., the degree to which they co-vary) is, not surprisingly, called correlation.

There are several different types of correlation, and we’ll talk about them later, but in this lesson we’re going to spend most of the time on the most commonly used type of correlation: the Pearson Product Moment Correlation. This correlation, signified by the symbol r, ranges from –1.00 to +1.00. A correlation of 1.00, whether it’s positive or negative, is a perfect correlation. It means that as scores on one of the two variables increase or decrease, the scores on the other variable increase or decrease by the same magnitude—something you’ll probably never see in the real world. A correlation of 0 means there’s no relationship between the two variables, i.e., when scores on one of the variables go up, scores on the other variable may go up, down, or whatever. You’ll see a lot of those.

Thus, a correlation of .8 or .9 is regarded as a high correlation, i.e., there is a very close relationship between scores on one of the variables with the scores on the other. And correlations of .2 or .3 are regarded as low correlations, i.e., there is some relationship between the two variables, but it’s a weak one. Knowing people’s score on one variable wouldn’t allow you to predict their score on the other variable very well.

Computing the Pearson Product Moment Correlation

Let’s do a correlation to see how the formula works and what it produces. The formula for the Pearson product moment correlation is:

Where:

rxy is the correlation coefficient between X and Y.

n is the size of the sample.

X is the individual’s score on the X variable.

Y is the individual’s score on the Y variable.

XY is the product of each X score times its corresponding Y score.

X2 is the individual X score squared.

Y2 is the individual Y score squared.

Let’s see what the correlation is between 30 students’ reading scores and their math scores. The data we need to compute the formula are given in Table 12.

Table 12

Reading and Math Scores and the Associated Data for Computing the Pearson Product Moment Correlation (N=30)

X

(Reading Scores)

Y

(Math Scores)

X2

Y2

XY

191 180 36481 32400 34380

103 101 10609 10201 10403

187 173 34969 29929 32351

108 103 11664 10609 11124

180 170 32400 28900 30600

118 113 13924 12769 13334

178 171 31684 29241 30438

127 122 16129 14884 15494

176 168 30976 28224 29568

134 130 17956 16900 17420

165 150 27225 22500 24750

147 145 21609 21025 21315

160 150 25600 22500 24000

157 154 24649 23716 24178

155 145 24025 21025 22475

168 164 28224 26896 27552

150 145 22500 21025 21750

172 170 29584 28900 29240

145 130 21025 16900 18850

185 179 34225 32041 33115

140 141 19600 19881 19740

195 193 38025 37249 37635

135 136 18225 18496 18360

100 101 10000 10201 10100

130 128 16900 16384 16640

125 121 15625 14641 15125

105 106 11025 11236 11130

120 118 14400 13924 14160

115 112 13225 12544 12880

110 108 12100 11664 11880

4381 4227 664583 616805 639987

So, we plug the numbers from this table into the formula, and do the math:

or

or

or

In this case, the correlation between reading and math scores is remarkably high (because I concocted the numbers so it would turn out that way). With real scores, it would be high, but not that high. If you glance over the numbers in Table 12, even before we’ve computed the correlation you can easily see (in this small sample of 30) that high scores in reading tend to go with high scores in math, low reading scores tend to go with low math scores, and so on. But, of course, you wouldn’t be able to see that pattern if you had a sample of 500.

Positive and Negative Correlations

I pointed out above that a correlation can vary from +1.00 to –1.00. The correlation we just computed is a positive correlation. That is, high reading scores go with high math scores, low with low, and so on. However, we could have a negative correlation. This is not something bad; it simply denotes an association in which high scores on one variable go with low scores on the other. For example, if we were computing a correlation between, say, amount of time students watch television and their achievement score, we would find a negative correlation: high TV watching is associated with lower achievement scores, and vice versa. Such a correlation might be something like –.71.

Determining Statistical Significance

OK, so we have a correlation coefficient. What precisely does it mean, and how do we interpret it? It’s not a percent, as many people mistakenly think.

First, we can determine its statistical significance in the same way we did with the t test. We can look it up in a table in the appendices of any statistical text. In the case of our .98 correlation between reading and math scores, if we look that up in the table for correlations, we find that the value needed to reject the null hypothesis at the .01 level of confidence (and declare that the correlation is statistically significant, or unlikely due to chance) for our sample of 30 is .45 (in this case using the one-tailed test because the samples are dependent).

So if we were stating this finding in a research report, we could say that the correlation of reading scores with math scores = .98 p <.01 with df = 28. (Now see how smart you are because you know what all that means.)

Practical vs. Statistical Significance

But we have the same issue we had with the t-test: determining its practical vs. its statistical significance. We don’t have an effect test, as we did with the t-test, but we have something similar. It has an imposing name—the coefficient of determination—but you’ll be ecstatically happy to learn that it’s very simple.

The coefficient of determination is nothing more than r2. You simply multiply r by itself, and you’ve got it. OK, you’ve got it, what does it mean? The coefficient of determination, r2, tells us how much of the variance in one of the variables is accounted for by the variance in the other variable. Thus, if we have a correlation of .60 between, say, students’ achievement scores and a measure of their socioeconomic status, r2 = 36. That means that 36% of the variance in the students’ achievement scores (not 60, which is the correlation) can be accounted for by variance in their socioeconomic status. But that also means that the remaining variance (64%) in achievement scores cannot be accounted for by socioeconomic status, but is attributable to many other factors, such as study time, intelligence, motivation, quality of instruction, and so on.

Other Correlations

All the correlations we’ve talked about so far have been based on what we call interval data, i.e., data where the distance between scores or values is the same. The distance between a 65 and 66 is assumed to be the same as the distance between a 14 and a 15. But many times we want to determine the relationship between two variables when that is not the case. Suppose, for example, we want to compute the correlation between students’ class rank in their junior year with their class rank in their senior year. Ranks are not the same as scores; there may be a much smaller (or bigger) difference between ranks 1 and 2 than between ranks 8 and 10 (like the difference between the first two teams and the last two teams in football or baseball). If the data we have are ranks rather than scores, we can’t use the product moment formula. But there is another correlation formula for use with ranks (it’s called rho).

And suppose we want to determine the relationship between two variables when one is based on what is called nominal or categorical data, and the other is interval data. An example would be correlating gender with achievement scores. Again, the product moment correlation can’t be used, but there is also a special formula for doing a correlation with these disparate types of data. In this case, it’s called the point biserial correlation.

Table 13 displays the several different types of correlation for use with variables based on different levels of measurement. In this course, we’re not going to compute them. But with the knowledge and skills you’ve developed thus far, when you encounter situations where the variables you want to correlate are based on different levels of measurement (interval, ordinal, or nominal), you’ll be able to select the type you need.

Table 13Alternative Types of Correlation for Different levels of Measurement*

Type of Measurement and Examples

Variable X Variable YCorrelation Being Computed

Type of Correlation

Interval (reading scores)

Interval (math scores)

Correlation between reading and math

Pearson product moment (r)

achievement

Ordinal (class rank in the junior year)

Ordinal (class rank in the senior year)

Correlation between class rank in the last two years of high school

Spearman rank coefficient (rho or p)

Nominal (social class, high, middle, or low

Ordinal (rank in high school graduating class)

Correlation between social class and rank in high school

Rank biserial coefficient (rbs)

Nominal (family configuration, e.g., intact or single parent)

Interval (grade point average)

Correlation between family configuration and grade point average

Point biserial(rpb)

Nominal (voting preference— Republican or Democrat)

Nominal (gender, i.e., male or female)

Correlation between voting preference and gender

Phi coefficient( )

*This table was adapted from a similar one found in Neil Salkind’s Statistics for People Who (Think They) Hate Statistics, Sage Publications, 2000, p. 101.

Correlation and Cause

Before we conclude this lesson, we need to understand one of the most important facts about correlation, namely, that it does not necessarily indicate cause. It may be that one of the variables does in fact cause the other, but we don’t know that just from the fact that the two are correlated.

Smoking and Lung Cancer

It is now an established fact that smoking causes lung cancer, but that conclusion could not be reached simply because there is a correlation between the two. When the association between smoking and lung cancer first appeared, and many argued that indicated that smoking caused lung cancer, the tobacco companies

argued that there were other factors that could explain the relationship, e.g., smoking is higher among blue collar workers who also have greater exposure to other toxic elements, smokers drink more and lead more stressful lives, and so on. And logically they were right. It took other kinds of direct physiological evidence and animal experiments to prove that the association was indeed causal.

We often find strong correlations where clearly a causal relationship makes no sense. For example, we may find a strong correlation between car sales and college attendance. Neither one of these is causing the other; both increase during financially prosperous times.

Wine Consumption and Heart Disease

But it is when two correlated variables seem likely to be causally related to one another that we tend to jump to the unsupportable conclusion that one causes the other. For example, when we hear about a correlation between an increase in stork nests and the birth rate in Germany, we laugh it off as clearly due to some unknown third factor. But when we hear that moderate wine consumption is associated with lower rates of heart disease, we’re ready to immediately conclude (especially if we’re wine lovers) that there is obviously some medically beneficial element in wine. But when these reports first came out, skeptics (they were probably statisticians) pointed out that other things could account for the association between moderate wine consumption and lower rates of heart disease. Moderate wine drinkers are likely to be more educated, non-smokers, get more exercise, and have lower rates of obesity. Again, as it has turned out, other kinds of physiological evidence do support the conclusion that moderate wine consumption is medically beneficial, but we can’t conclude that just on the basis of the correlation.

The Important Lesson About Correlation and Cause

The important lesson here is that the correlation coefficient is a highly useful statistic for determining the relationship between variables, but a correlation does not demonstrate a causal relationship between the variables.

The same holds for differences between means. If, for example, we give a pre-test and a post-test to students who have participated in a new reading program, and we find that the increase in the mean reading score is both statistically and practically significant, that does not entitle us to conclude that the new program caused the

increase. Any number of other factors could account for the increase: the students were older, and they had been exposed to many other influences and experiences that could have—and probably did—improve their reading. To determine how much, if any, of the improvement was caused by the new program, we would have to employ a control group (or some other method for determining "the expectation of non-treatment"). This would tell us how much improvement occurred in comparable students who had the same experiences except for the new reading program.

Lesson 7

Chi Square

Parametric and Non-Parametric Statistics

Most of the statistics we’ve learned so far—the mean, the standard deviation, the t- test, and the product moment correlation—belong to a category called parametric statistics. That’s because it is assumed the data used to compute them have certain parameters or meet certain conditions. One of these is that the variances are similar; another is that the sample is large enough to be representative of the universe from which it is drawn. We used examples of 30 or more cases when we worked on the mean, the t-test, and the product moment correlation because there is a general consensus among statisticians that this is the minimum-size sample to use with parametric tests. You should keep this in mind when using these tests in your practicum and in your own research.

But what do we do when we can’t meet these conditions? Happily, there’s another category of statistics, and you shouldn’t be surprised to learn that it’s called non-parametric statistics. We can do many of the same things with non-parametric statistics. They’re regarded as somewhat less powerful than parametric statistics, but they’re not to be looked down on. When conditions call for them, they are the things to use.

Chi Square

One of the most useful of the non-parametric statistics is chi square. We use it when our data consist of people distributed across categories, and we want to know whether that distribution is different from what we would expect by chance (or another set of expectations). We don’t have scores, we don’t have means. We just have numbers, or frequencies. In other words, we have nominal data.

For example, suppose we have the data in Table 14 that display the number of students who elect different majors, and we want to know whether those numbers differ from chance. In other words, are some majors selected more often than others, or is the selection pattern essentially random?

Table 14Number of Students Selecting Different Majors

Pre-MedComputer Sciences

English Literature Education Engineeri

ngTotal

50 85 25 60 80 300

The null hypothesis here, of course, is that there is no difference between this distribution of major selections from what would be expected by chance. So what chi square does is compare these numbers (the observed frequencies) with those that would be expected by chance (the expected frequencies).

The formula for chi square is:

Where:

is the value for chi square.

is the sum.

O is the observed frequency

E is the expected frequency.

The first question in doing the calculation is, how do we get the expected frequencies? That’s easy. If we are testing the observed frequencies (those in Table 14) against what we would expect by chance, since we have five categories of majors, we would expect one-fifth of the individuals to fall in each of the categories. One-fifth (20%) of 300 is 60. So if the selection of majors is largely a chance pattern, we would expect to find 60 people in each category.

Table 15 displays the observed and expected frequencies for each major, computes the difference between them (O–E), squares O–E ((O–E)2), divides the squares by the expected frequencies ((O–E)2/E), and sums those quantities to give us our , which is 39.17.

Table 15Observed and Expected Frequencies for the Selection of Majors

MajorO

(observed frequency

)

E (expected frequency

)

O–E (O–E)2 (O–E)2/E

Pre-Med 50 60 -10 100 1.67

Computer Sciences

85 60 25 625 10.42

English Literature

25 60 -35 1225 20.42

Education 60 60 0 0 0.00

Engineering

80 60 20 400 6.67

Total 300 300 39.17

By now, you know the next step: determining if we can reject the null hypothesis. We do it the same way we did for the t-test and the correlation. We enter the chi square significance table (which I have handy, but you don’t) with our chi square value (39.17) and the appropriate degrees of freedom. For chi square, the degrees of freedom are equal to the number of rows minus one (R–1). In our case we have five rows, so df = 4.

Entering the chi square table with our result of 39.17 and df = 4, we find that we need a chi square value of 13.28 to reject the null hypothesis at the .01 level of confidence. We clearly have that, so we can say that the distribution of major selections is a not simply a chance pattern; or = 39.7 p <.01, df = 4.

Lesson 8

Summarizing the Steps and Moving On

In the statistical tests we’ve calculated (the t-test, correlation, and X2), we’ve gone through a series of steps that you’ll go through when you compute any statistical test.

Recapping, here they are:

1. First, determine the level of measurement you have. Are the data you have interval, ordinal, or nominal?

2. If you have interval data, determine whether they meet the requirements of a parametric test (adequate sample size and variance similarity).

3. Based on the determinations you made from (1) and (2), select the statistical test (t, r, X2, or whatever).

4. Calculate the values required, plug them into the formula, and compute the test. (Now that you have gone through these calculations and understand them, the labor can be done for you by any one of the available statistical software packages.)

5. Select the level of risk you want to take in rejecting the null hypothesis and making (or avoiding) the Type I and Type II errors. Usually that will be .05 or .01.

6. Enter the appropriate significance table (e.g., for t, r, or X2) with the test result and the proper degrees of freedom.

7. Determine whether your test result is large enough to reject the null hypothesis and enable you to conclude that it is statistically significant.

8. If it is statistically significant, use whatever additional tests may be available (e.g., the effect test, the coefficient of determination, etc.) and your own reasoned judgment to determine if the result is also practically significant.

* * * *

Congratulate yourself. The fact that you understand these steps and can execute them shows how far you’ve come. You now have a good grip on basic statistics. You can understand them in research journals, and you can use them in your practicum and in your own research. And you are now in a position to go on to more advanced statistics (I know you can’t wait).

References

I have not provided a set of references because there are literally dozens of introductory statistics texts, and just about any of them will do. You definitely should have one of these texts for reference purposes, especially for the significance tables they all provide. My favorite, and the one I highly recommend, is Neil Salkind's Statistics for People Who (Think They) Hate Statistics. Sage Publications, 2000.

Statistical Software

This short course has taken you through both the explanation of the major statistical concepts and the actual computation of the most common statistical tests you will be encountering in the research literature and using in your own research.

Now that you have this essential, basic understanding, you won’t need to do any computations by hand. There are software applications that will do that for you. Once you enter the data, they will compute a correlation in less than a second, and provide you with the significance levels.

statistics basics

Documents