wuensch stats

ANOVA1.docx One-Way Independent Samples Analysis of Variance If we are interested in the relationship between a categorical IV and a continuous DV, the two categories analysis of variance (ANOVA) may be a suitable inferential technique. If the IV had only two levels (groups), we could just as well do a t-test, but the ANOVA allows us to have 2 or more categories. The null hypothesis tested is that 1 = 2 = ... = k, that is, all k treatment groups have identical population means on the DV. The alternative hypothesis is that at least two of the population means differ from one another. We start out by making two assumptions: - Each of the k populations is normally distributed and - Homogeneity of variance - each of the populations has the same variance, the IV does not affect the variance in the DV. Thus, if the populations differ from one another they differ in location (central tendency, mean). The model we employ here states that each score on the DV has two components: - the effect of the treatment (the IV, Groups) and - error, which is anything else that affects the DV scores, such as individual differences among subjects, errors in measurement, and other extraneous variables. That is, Yij = + tj + eij, or, Yij - = tj + eij The difference between the grand mean ( ) and the DV score of subject number i in group number j, Yij, is equal to the effect of being in treatment group number j, tj, plus error, eij [Note that I am using i as the subscript for subject # and j for group #] Computing ANOVA Statistics From Group Means and Variances, Equal n. Let us work with the following contrived data set. We have randomly assigned five students to each of four treatment groups, A, B, C, and D. Each group receives a different type of instruction in the logic of ANOVA. After instruction, each student is given a 10 item multiple-choice test. Test scores (# items correct) follow: Group Scores Mean A 1 2 2 2 3 2 B 2 3 3 3 4 3 C 6 7 7 7 8 7 D 7 8 8 8 9 8 Now, do these four samples differ enough from each other to reject the null hypothesis that type of instruction has no effect on mean test performance? First, we use the sample data to estimate the amount of error variance in the scores in the population from which the samples were randomly drawn. That is variance (differences among scores) that is due to anything other than the IV. One simple way to do this, assuming that you have an equal number of scores in each sample, is to compute the average within group variance, ks s sMSE k2 2221.... + + += Copyright 2012, Karl L. Wuensch - All rights reserved. A-PDF Merger DEMO : Purchase from www.A-PDF.com to remove the watermark A-PDF Merger DEMO : Purchase from www.A-PDF.com to remove the watermark 2 s j2 is the sample variance in Group number j. Thought exercise: Randomly chose any two scores that are in the same group. If they differ from each other, why? Is the difference because they got different treatments? Of course not, all subjects in the same group got the same treatment. It must be other things that caused them to have different scores. Those other things, collectively, are referred to as error. MSE is the mean square error (aka mean square within groups): Mean because we divided by k, the number of groups, square because we are working with variances, and error because we are estimating variance due to things other than the IV. For our sample variances the MSE = (.5 + .5 + .5 + .5) / 4 = 0.5 MSE is not the only way to estimate the population error variance. If we assume that the null hypothesis is true, we can get a second estimate of population error variance that is independent of the first estimate. We do this by finding the sample variance of the k sample means and multiplying by n, where n = number of scores in each group (assuming equal sample sizes). That is, 2means A s n MS - = I am using MSA to stand for the estimated among groups or treatment variance for Independent Variable A. Although you only have one IV now, you should later learn how to do ANOVA with more than one IV. For our sample data we compute the variance of the four sample means, VAR(2,3,7,8) = 26 / 3 and multiply by n, so MSA = 5 - 26 / 3 = 43.33. Now, our second estimate of error variance, the variance of the means, MSA , assumes that the null hypothesis is true. Our first estimate, MSE, the mean of the variances, made no such assumption. If the null hypothesis is true, these two estimates should be approximately equal to one another. If not, then the MSA will estimate not only error variance but also variance due to the IV, and MSA > MSE. We shall determine whether the difference between MSA and MSE is large enough to reject the null hypothesis by using the F-statistic. F is the ratio of two independent variance estimates. We shall compute F = MSA / MSE which, in terms of estimated variances, is the effect of error and treatment divided by the effect of error alone. If the null hypothesis is true, the treatment has no effect, and F = [error / error] = approximately one. If the null hypothesis is false, then F = [(error + treatment) / error] > 1. Large values of F cast doubt on the null hypothesis, small values of F do not. For our data, F = 43.33 / .5 = 86.66. Is this F large enough to reject the null hypothesis or might it have happened to be this large due to chance? To find the probability of getting an F this large or larger, our exact significance level, p, we must work with the sampling distribution of F. This is the distribution that would be obtained if you repeatedly drew sets of k samples of n scores each all from identical populations and computed MSA / MSE for each set. It is a positively skewed sampling distribution with a mean of about one. Using the F-table, we can approximate p. Like t-distributions, F-distributions have degrees of freedom, but unlike t, F has df for numerator (MSA ) and df for denominator (MSE). The total df in the k samples is N - 1 (where N = total # scores) because the total variance is computed using sums of squares for N scores about one point, the grand mean. The treatment A df is k - 1 because it is computed using sums of squares for k scores (group means) about one point, the grand mean. The error df is k(n - 1) because MSE is computed using k within groups sums of squares each computed on n scores about one point, the group mean. For our data, total df = N - 1 = 20 - 1 = 19. Treatment A df = k - 1 = 4 - 1 = 3. Error df = k(n - 1) = N - k = 20 - 4 = 16. Note that total df = treatment df + error df. So, what is p? Using the F-table in our text book, we see that there is a 5% probability of getting an F(3, 16) > = 3.24. Our F > 3.24, so our p < .05. The table also shows us that the upper 1% of an F-distribution on 3, 16 df is at and beyond F = 5.29, so our p < .01. We can reject the null hypothesis even with an a priori alpha criterion of .01. Note that we are using a one-tailed test with nondirectional hypotheses, because regardless of the actual ordering of the population means, for 3 example, 1 > 2 > 3 > 4 or 1 > 4 > 3 > 2, etc., etc., any deviation in any direction from the null hypothesis that 1 = 2 = 3 = 4 will cause the value of F to increase. Thus we are only interested in the upper tail of the F-distribution. Derivation of Deviation Formulae for Computing ANOVA Statistics Lets do the ANOVA again using different formulae. Lets start by computing the total sum-of-squares (SSTOT) and then partition it into treatment (SSA) and error (SSE) components. We shall derive formulae for the ANOVA from its model. If we assume that the error component is normally distributed and independent of the IV, we can derive formulas for the ANOVA from this model. First we substitute sample statistics for the parameters in the model: Yij = GM + (Mj - GM) + (Yij - Mj) Yij is the score of subject number i in group number j, GM is the grand mean, the mean of all scores in all groups, Mj is the mean of the scores in the group ( j ) in which Yij is. Now we subtract GM from each side, obtaining: (Yij - GM) = (Mj - GM) + (Yij - Mj ) Next, we square both sides of the expression, obtaining: (Yij - GM)2 = (Mj - GM)2 + (Yij - Mj )2 + 2(Mj - GM)(Yij - Mj ) Now, summing across subjects ( i ) and groups ( j ), Eij(Yij - GM)2 = Eij(Mj - GM)2 + Eij(Yij - Mj )2 + 2 - Eij(Mj - GM)(Yij - Mj ) Now, since the sum of the deviations of scores about their mean is always zero, 2 - Eij(Mj - GM)(Yij - Mj ) equals zero, and thus drops out, leaving us with: Eij(Yij - GM)2 = Eij(Mj - GM)2 + Eij(Yij - Mj )2 Within each group (Mj - GM) is the same for every Yij, so Eij(Mj - GM)2 equals Ej[nj - (Mj - GM)2], leaving us with Eij(Yij - GM)2 = Ej [nj - (Mj - GM)2] + Eij(Yij - Mj )2 Thus, we have partitioned the leftmost term (SSTOT ) into SSA (the middle term) and SSE (the rightmost term). SSTOT = E (Yij - GM)2. For our data, SSTOT = (1 - 5)2 + (2 - 5)2 +...+ (9 - 5)2 = 138. To get the SSA, the among groups or treatment sum of squares, for each score subtract the grand mean from the mean of the group in which the score is. Then square each of these deviations and sum them. Since the squared deviation for group mean minus grand mean is the same for every score within any one group, we can save time by computing SSA as: SSA = E [nj - (Mj - GM)2] [Note that each groups contribution to SSA is weighted by its n, so groups with larger ns have more influence. This is a weighted means ANOVA. If we wanted an unweighted means (equally weighted means) ANOVA we could use a harmonic mean nh = k Ej(1/nj) in place of n (or just be sure we have equal ns, in which case the weighted means analysis is an equally weighted means analysis). With unequal ns and an equally weighted analysis, SSTOT = SSA + SSE. Given equal sample sizes (or use of harmonic mean nh ), the formula for SSA simplifies to: SSA = n - E (Mj - GM)2. For our data, SSA = 5[(2 - 5)2 + (3 - 5)2 + (7 - 5)2 + (8 - 5)2] = 130. The error sum of squares, SSE = E (Yij - Mj)2. 4 These error deviations are all computed within treatment groups, so they reflect variance not due to the IV, that is, error. Since every subject within any one treatment group received the same treatment, variance within groups must be due to things other than the IV. For our data, SSE = (1 - 2)2 + (2 - 2)2 + .... + (9 - 8)2 = 8. Note that SSA + SSE = SSTOT. Also note that for each SS we summed across all N scores the squared deviations between either Yij or Mj and either Mj or GM. If we now divide SSA by its df and SSE by its df we get the same mean squares we earlier obtained. Computational Formulae for ANOVA Unless group and grand means are nice small integers, as was the case with our contrived data, the above method (deviation formulae) is unwieldy. It is, however, easier to see what is going on in ANOVA with that method than with the computational method I am about to show you. Use the following computational formulae to do ANOVA on a more typical data set. In these formulae G stands for the total sum of scores for all N subjects and Tj stands for the sum of scores for treatment group number j. NGY SSTOT22 = =NGnTSSjjA22, which simplifies to: NGnTSS jA22= when sample size is constant across groups. SSE = SSTOT - SSA. For our sample data, SSTOT = (1 + 4 + 4 +.....+ 81) - [(1 + 2 + 2 +.....+ 9)2] N = 638 - (100)2 20 = 138 SSA = [(1+2+2+2+3)2 + (2+3+3+3+4)2 + (6+7+7+7+8)2 + (7+8+8+8+9)2] 5 - (100)2 20 = 130 SSE = 138 - 130 = 8. ANOVA Source Table and APA-Style Summary Statement Summarizing the ANOVA in a source table: Source SS df MS F Teaching Method 130 3 43.33 86.66 Error 8 16 0.50 Total 138 19 In an APA journal the results of this analysis would be summarized this way: Teaching method significantly affected test scores, F(3, 16) = 86.66, MSE = 0.50, p < .001, e2 = .93. If the researcher had a means of computing the exact significance level, that would be reported. For example, one might report p = .036 rather than p < .05 or .01 < p < .05. One would also typically refer to a table or figure with basic descriptive statistics (group means, sample sizes, and standard deviations) and would conduct some additional analyses (like the pairwise comparisons we stall study in our next lesson). If you are confident that the population variances are homogeneous, and have reported the MSE (which is an estimate of the population variances), then reporting group standard deviations is optional. 5 Violations of Assumptions You should use boxplots, histograms, comparisons of mean to median, and/or measures of skewness and kurtosis (available in SAS, the Statistical Analysis System, a delightful computer package) on the scores within each group to evaluate the normality assumption and to identify outliers that should be investigated (and maybe deleted, if you are willing to revise the population to which you will generalize your results, or if they represent errors in data entry, measurement, etc.). If the normality assumption is not tenable, you may want to transform scores or use a nonparametric analysis. If the sample data indicate that the populations are symmetric, or, slightly skewed but all in the same direction, the ANOVA should be sufficiently robust to handle the departure from normality. You should also compute Fmax, the ratio of the largest within-group variance to the smallest within-group variance. If Fmax is less than 4 or 5 (especially with equal or nearly equal sample sizes and normal or nearly normal within-group distributions), then the ANOVA should be sufficiently robust to handle the departure from homogeneity of variance. If not, you may wish to try data transformations or a nonparametric test, keeping in mind that if the populations cannot be assumed to have identical shapes and dispersions, rejection of the nonparametric null hypothesis cannot be interpreted as meaning the populations differ in location. It is possible to adjust the df to correct for heterogeneity of variance, as we did with the separate variances t-test. Box has shown that the true critical F under heterogeneity of variance is somewhere between the critical F on 1,(n - 1) df and the unadjusted critical F on (k - 1), k(n - 1) df, where n = the number of scores in each group (equal sample sizes). It might be appropriate to use a harmonic mean nh = k Ej(1/nj), with unequal sample sizes (consult Box - the reference is in Howell). If your F is significant on 1, (n - 1) df, it is significant at whatever the actual adjusted df are. If it is not significant on (k - 1), k(n - 1) df, it is not significant at the actual adjusted df. If it is significant on (k - 1), k(n - 1) df but not on 1, (k - 1) df, you dont know whether or not it is significant with the true adjusted critical F. If you cannot reach an unambiguous conclusion using Boxs range for adjusted critical F, you may need to resort to Welchs test, explained in our textbook (and I have an example below). You must compute for each group W, the ratio of sample size to sample variance. Then you compute an adjusted grand mean, an adjusted F, and adjusted denominator df. You may prefer to try to meet the assumptions by employing nonlinear transformations of the data prior to analysis. Here are some suggestions: When the group standard deviations appear to be a linear function of the group means (try correlating the means with the standard deviations or plotting one against the other), a logarithmic transformation should reduce the resulting heterogeneity of variance. Such a transformation will also reduce positive skewness, since the log transformation reduces large scores more than small scores. If you have negative scores or scores near zero, you will need to add a constant (so that all scores are 1 or more) before taking the log, since logs of numbers of zero or less are undefined. If group means are a linear function of group variances (plot one against the other or correlate them), a square root transformation might do the trick. This will also reduce positive skewness, since large scores are reduced more than small scores. Again, you may need to first add a constant, c, or use c X X + + to avoid imaginary numbers like 1. A reciprocal transformation, T = 1/Y or T = -1/Y, will very greatly reduce large positive outliers, a common problem with some types of data, such as running times in mazes or reaction times. If you have negative skewness in your data, you may first reflect the variable to convert negative skewness to positive skewness and then apply one of the transformations that reduce positive skewness. For example, suppose you have a variable on a scale of 1 to 9 which is negatively skewed. Reflect the variable by subtracting each score from 10 (so that 9s become 1s, 8s become 2s, 7s become 3s, 6s become 4s, 4s become 6s, 3s become 7s, 2s become 8s, 6 and 1s become 9s. Then see which of the above transformations does the best job of normalizing the data. Do be careful when it comes time to interpret your resultsif the original scale was 1 = complete agreement with a statement and 9 = complete disagreement with the statement, after reflection high scores indicate agreement and low scores indicate disagreement. This is also true of the reciprocal transformation 1/Y (but not -1/Y). For more information on the use of data transformation to reduce skewness, see my documents Using SAS to Screen Data and Using SPSS to Screen Data. Where Y is a proportion, p, for example, proportion of items correct on a test, variances (npq, binomial) will be smaller in groups where mean p is low or high than in groups where mean p is close to .5. An arcsine transformation, T = 2 - ARCSINE (Y), may help. It may also normalize by stretching out both tails relative to the middle of the distribution. Another option is to trim the samples. That is, throw out the extreme X% of the scores in each tail of each group. This may stabilize variances and reduce kurtosis in heavy-tailed distributions. A related approach is to use Winsorized samples, where all of the scores in the extreme X% of each tail of each sample are replaced with the value of the most extreme score remaining after trimming. The modified scores are used in computing means, variances, and test statistics (such as F or t), but should not be counted in n when finding error df for F, t, s2, etc. Howell suggests using the scores from trimmed samples for calculating sample means and MSA, but Winsorized samples for calculating sample variances and MSE. If you have used a nonlinear transformation such as log or square-root, it is usually best to report sample means and standard deviations this way: find the sample means and standard deviations on the transformed data and then reverse the transformation to obtain the statistics you report. For example, if you used a log transformation, find the mean and sd of log-transformed data and then the antilog (INV LOG on most calculators) of those statistics. For square-root-transformed data, find the square of the mean and sd, etc. These will generally not be the same as the mean and sd of the untransformed data. How do you choose a transformation? I usually try several transformations and then evaluate the resulting distributions to determine which best normalizes the data and stabilizes the variances. It is not, however, proper to try many transformations and choose the one that gives you the lowest significance level - to do so inflates alpha. Choose your transformation prior to computing F or t. Do check for adverse effects of transformation. For example, a transformation that normalizes the data may produce heterogeneity of variance, in which case you might need to conduct a Welch test on transformed data. If the sample distributions have different shapes a transformation that normalizes the data in one group may change those in another group from nearly normal to negatively skewed or otherwise nonnormal. Some people get very upset about using nonlinear transformations. If they think that their untransformed measurements are interval scale data, linear transformations of the true scores, they delight in knowing that their computed ts or Fs are exactly the same that would be obtained if they had computed them on God-given true scores. But if a nonlinear transformation is applied, the transformed data are only ordinal scale. Well, keep in mind that Fechner and Stevens (psychophysical laws) have shown us that our senses also provide only ordinal data, positive monotonic (but usually not linear) transformation of the physical magnitudes that constitute one reality. Can we expect more of our statistics than of our senses? I prefer to simply generalize my findings to that abstract reality which is a linear transformation of my (sensory or statistical) data, and I shall continue to do so until I get a hot-line to God from whom I can obtain the truth with no distortion. Do keep in mind that one additional nonlinear transformation available is to rank the data and then conduct the analysis on the ranks. This is what is done in most nonparametric procedures, and 7 they typically have simplified formulas (using the fact that the sum of the integers from 1 to n equals n(n + 1) 2) with which one can calculate the test statistic. Computing ANOVA Statistics From Group Means and Variances with Unequal Sample Sizes and Heterogeneity of Variance Wilbur Castellow (while he was chairman of our department) wanted to evaluate the effect of a series of changes he made in his introductory psychology class upon student ratings of instructional excellence. Institutional Research would not provide the raw data, so all we had were the following statistics: Semester Mean SD N pj Spring 89 4.85 .360 34 34/133 = .2556 Fall 88 4.61 .715 31 31/133 = .2331 Fall 87 4.61 .688 36 36/133 = .2707 Spring 87 4.38 .793 32 32/133 = .2406 1. Compute a weighted mean of the K sample variances. For each sample the weight is Nnp jj = . . 4317 . ) 793 (. 2406 . ) 688 (. 2707 . ) 715 (. 2331 . ) 360 (. 2556 .2 2 2 2 2 = + + + = = j js p MSE 2. Obtain the Among Groups SS, E nj (Mj - GM)2. The GM = E pj Mj =.2556(4.85) + .2331(4.61) + .2707(4.61) + .2406(4.38) = 4.616. Among Groups SS = 34(4.85 - 4.616)2 + 31(4.61 - 4.616)2 + 36(4.61 - 4.616)2 + 32(4.38 - 4.616)2 = 3.646. With 3 df, MSA = 1.215, and F(3, 129) = 2.814, p = .042. 3. Before you get excited about this significant result, notice that the sample variances are not homogeneous. There is a negative correlation between sample mean and sample variance, due to a ceiling effect as the mean approaches its upper limit, 5. The ratio of the largest to the smallest variance is .7932/.3602 = 4.852, which is significant beyond the .01 level with Hartleys maximum F-ratio statistic (a method for testing the null hypothesis that the variances are homogeneous). Although the sample sizes are close enough to equal that we might not worry about violating the homogeneity of variance assumption, for instructional purposes let us make some corrections for the heterogeneity of variance. 4. Box (1954, see our textbook) tells us the critical (.05) value for our F on this problem is somewhere between F(1, 30) = 4.17 and F(3, 129) = 2.675. Unfortunately our F falls in that range, so we dont know whether or not it is significant. 5. The Welch procedure (see the formulae in our textbook) is now our last resort, since we cannot transform the raw data (which we do not have). W1 = 34 / .3602 = 262.35, W2 = 31 / .7152 = 60.638, W3 = 36 / .6882 = 76.055, and W4 = 32 / .7932 = 50.887. 8 . 724 . 493 . 44944 . 2125887 . 50 055 . 76 638 . 60 35 . 262) 38 . 4 ( 887 . 50 ) 61 . 4 ( 055 . 76 ) 61 . 4 ( 638 . 60 ) 85 . 4 ( 35 . 262! = =+ + + + + += X The numerator of F'' = 3) 724 . 4 - 38 . 4 ( 887 . 50 + ) 724 . 4 - 61 . 4 ( 055 . 76 + ) 724 . 4 - 61 . 4 ( 638 . 60 + ) 724 . 4 - 85 . 4 ( 35 . 262 2 2 2 2= 3.988. The denominator of F'' equals 93 . 449887 . 50 131193 . 449055 . 76 135193 . 449638 . 60 130193 . 44935 . 262 133115412 2 2 2(((

|.|

\| |.|

\|+|.|

\| |.|

\|+|.|

\| |.|

\|+|.|

\| |.|

\|+ = 1 + 4 / 15(.07532) = 1.020. Thus, F'' = 3.988 / 1.020 = 3.910. Note that this F'' is greater than our standard F. Why? Well, notice that each groups contribution to the numerator is inversely related to its variance, thus increasing the contribution of Group 1, which had a mean far from the Grand Mean and a small variance. We are not done yet, we still need to compute adjusted error degrees of freedom; df' = (15) / [3(.07532)] = 66.38. Thus, F(3, 66) = 3.910, p = .012. Directional Hypotheses I have never seen published research where the authors used ANOVA and employed a directional test, but it is possible. Suppose you were testing the following directional hypotheses: H0: The classification variable is not related to the outcome variable in the way specified in the alternative hypothesis H1: 1 > 2 > 3 The one-tailed p value that you obtain with the traditional F test tells you the probability of getting sample means as (or more) different from one another, in any order, as were those you obtained, were the truth that the population means are identical. Were the null true, the probability of your correctly predicting the order of the differences in the sample means is k!, where k is the number of groups. By application of the multiplication rule of probability, the probability of your getting sample means as different from one another as they were, and in the order you predicted, is the one-tailed p times k!. If k is three, you take the one-tailed p and divide by 3 x 2 = 6 a one-sixth tailed test. I know, that sounds strange. Lots of luck convincing the reviewers of your manuscript that you actually PREdicted the order of the means. They will think that you POSTdicted them. Fixed vs. Random vs. Mixed Effects ANOVA As in correlation/regression analysis, the IV in ANOVA may be fixed or random. If it is fixed, the researcher has arbitrarily (based on es opinon, judgement, or prejudice) chosen k values of the IV. E will restrict es generalization of the results to those k values of the IV. E has defined the population of IV values in which e is interested as consisting of only those values e actually used, thus, e has used the entire population of IV values. For example, I give subjects 0, 1, or 3 beers and measure reaction time. I can draw conclusions about the effects of 0, 1, or 3 beers, but not about 2 beers, 4 beers, 10 beers, etc. With a random effects IV, one randomly obtains levels of the IV, so the actual levels used would not be the same if you repeated the experiment. For example, I decide to study the effect of dose of phenylpropanolamine upon reaction time. I have my computer randomly select ten dosages from a uniform distribution of dosages from zero to 100 units of the drug. I then administer those 10 dosages to my subjects, collect the data, and do the analyses. I may generalize across the entire range of values (doses) from which I randomly selected my 10 values, even (by interpolation or extrapolation) to values other than the 10 I actually employed. 9 Group5 4 3 2 1 0Score1086420Group5 4 3 2 1 0Score1086420 In a factorial ANOVA, one with more than one IV, you may have a mixed effects ANOVA - one where one or more IVs is fixed and one or more is random. Statistically, our one-way ANOVA does actually have two IVs, but one is sort of hidden. The hidden IV is SUBJECTS. Does who the subject is affect the score on the DV? Of course it does, but we count such effects as error variance in the one-way independent samples ANOVA. Subjects is a random effects variable, or at least we pretend it is, since we randomly selected subjects from the population of persons (or other things) to which we wish to generalize our results. In fact, if there is not at least one random effects IV in your research, you dont need ANOVA or any other inferential statistic. If all of your IVs are fixed, your data represent the entire population, not a random sample therefrom, so your descriptive statistics are parameters and you need not infer what you already know for sure. ANOVA as a Regression Analysis: Eta-Squared and Omega-Squared The ANOVA is really just a special case of a regression analysis. It can be represented as a multiple regression analysis, with one dichotomous "dummy variable" for each treatment degree of freedom (more on this in another lesson). It can also be represented as a bivariate, curvilinear regression. Here is a scatter plot for our ANOVA data. Since the numbers used to code our groups are arbitrary (the independent variable being qualitative), I elected to use the number 1 for Group A, 2 for Group D, 3 for Group C and 4 for Group B. Note that I have used blue squares to plot the points with a frequency of three and red triangles to plot those with a frequency of one. The blue squares are also the group means. I have placed on the plot the linear regression line predicting score from group. The regression falls far short of significance, with the SSRegression being only 1, for an r2 of 1/138 = .007. We could improve the fit of our regression line to the data by removing the restriction that it be a straight line, that is, by doing a curvilinear regression. A quadratic regression line is based on a polynomial where the independent variables are Group and Group-squared that is, 22 1 X b X b a Y + + = more on this when we cover trend analysis. A quadratic function allows us one bend in the curve. Here is a plot of our data with a quadratic regression line. Eta-squared ( q2 ) is a curvilinear correlation coefficient. To compute it, one first uses a curvilinear equation to predict values of Y|X. You then compute the SSError as the sum of squared residuals between actual Y and predicted Y, that is, ( ) =2Y Y SSE . As usual, 10 Group5 4 3 2 1 0Score1086420( ) =2GM Y SSTotal, where GM is the grand mean, the mean of all scores in all groups. The SSRegression is then SSTotal - SSError. Eta squared is then SSRegression / SSTotal, the proportion of the SSTotal that is due to the curvilinear association with X. For our quadratic regression (which is highly significant), SSRegression = 126, q2 = .913. We could improve the fit a bit more by going to a cubic polynomial model (which adds Group-cubed to the quadratic model, allowing a second bending of the curve). Here is our scatter plot with the cubic regression line. Note that the regression line runs through all of the group means. This will always be the case when we have used a polynomial model of order = K 1, where K = the number of levels of our independent variable. A cubic model has order = 3, since it includes three powers of the independent variable (Group, Group-squared, and Group-cubed). The SSRegression for the cubic model is 130, q2 = .942. Please note that this SSRegression is exactly the same as that we computed earlier as the ANOVA SSAmong Groups. We have demonstrated that a poynomial regression with order = K 1 is identical to the traditional one-way ANOVA. Take a look at my document T = ANOVA = Regression. Strength of Effect Estimates Proportions of Variance Explained We can employ q2 as a measure of the magnitude of the effect of our ANOVA independent variable without doing the polynomial regression. We simply find Totals AmongGroupSSSS from our ANOVA source table. This provides a fine measure of the strength of effect of our independent variable in our sample data, but it generally overestimates the population q2 . My programs Conf-Interval-R2-Regr.sas and CI-R2-SPSS.zip will compute an exact confidence interval about eta-squared. For our data q2 = 130/138 = .94. A 95% confidence interval for the population parameter extends from .84 to .96. It might be better to report a 90% confidence interval here, more on that soon. One well-known alternative is omega-squared, e2 , which estimates the proportion of the variance in Y in the population which is due to variance in X. Error TotalError AmongMS SSMS K SS+ =) 1 (2e . For our data, 93 .5 . 1385 ). 3 ( 1302=+= e . Benchmarks for q2. - .01 (1%) is small but not trivial - .06 is medium - .14 is large A Word of Caution. Rosenthal has found that most psychologists misinterpret strength of effect estimates such as r2 and e2. Rosenthal (1990, American Psychologist, 45, 775-777.) used an example where a treatment (a small daily dose of aspirin) lowered patients death rate so much that the researchers conducting this research the research prematurely and told the participants who were 11 in the control condition to start taking a baby aspirin every day. So, how large was the effect of the baby aspirin? As an odds ratio it was 1.83 that is, the odds of a heart attack were 1.83 times higher in the placebo group than in the aspirin group. As a proportion of variance explained the effect size was .0011 (about one tenth of one percent). One solution that has been proposed for dealing with r2-like statistics is to report their square root instead. For the aspirin study, we would report r = .033 (but that still sounds small to me). Also, keep in mind that anything that artificially lowers error variance, such as using homogeneous subjects and highly controlled laboratory conditions, artificially inflates r2, e2, etc. Thus, under highly controlled conditions, one can obtain a very high e2 even if outside the laboratory the IV accounts for almost none of the variance in the DV. In the field those variables held constant in the lab may account for almost all of the variance in the DV. What Confidence Coefficient Should I Employ for q2 and RMSSE? If you want the confidence interval to be equivalent to the ANOVA F test of the effect (which employs a one-tailed, upper tailed, probability) you should employ a confidence coefficient of (1 - 2). For example, for the usual .05 criterion of statistical significance, use a 90% confidence interval, not 95%. Please see my document Confidence Intervals for Squared Effect Size Estimates in ANOVA: What Confidence Coefficient Should be Employed? . Strength of Effect Estimates Standardized Differences Among Means When dealing with differences between or among group means, I generally prefer strength of effect estimators that rely on the standardized difference between means (rather than proportions of variance explained). We have already seen such estimators when we studied two group designs (Hedges g) but how can we apply this approach when we have more than two groups? My favorite answer to this question is that you should just report estimates of Cohens d for those contrasts (differences between means or sets of means) that are of most interest that is, which are most relevant to the research questions you wish to address. Of course, I am also of the opinion that we would often be better served by dispending with the ANOVA in the first place and proceeding directly to making those contrasts of interest without doing the ANOVA. There is, however, another interesting suggestion. We could estimate the average value of Cohens d for the groups in our research. There are several ways we could do this. We could, for example, estimate d for every pair of means, take the absolute values of those estimates, and then average them. James H. Steiger (2004: Psychological Methods, 9, 164-182) has proposed the use of RMSSE (root mean square standardized effect) in situations like this. Here is how the RMSSE is calculated: ||.|

\| |.|

\|= kjMSEGM MkRMSSE1211, where k is the number of groups, Mj is a group mean, GM is the overall (grand) mean, and the standardizer is the pooled standard devation, the square root of the within groups mean square, MSE (note that we are assuming homogeneity of variances). Basically what we are doing here is averaging the values of (Mj GM)/SD, having squared them first (to avoid them summing to zero), dividing by among groups degrees of freedom (k -1) rather than k, and then taking the square root to get back to un-squared (standard deviation) units. Since the standardizer (sqrt of MSE) is constant across groups, we can simplify the expression above to 2) (11MSEGM MkRMSSE j |.|

\|= . 12 For our original set of data, the sum of the squared deviations between group means and grand mean is (2-5)2 + (3-5)2 + (7-5)2 + (8-5)2 = 26. Notice that this is simply the among groups sum of squares (130) divided by n (5). Accordingly, 16 . 45 .261 41= |.|

\|= RMSSE , a Godzilla-sized average standardized difference between group means. We can place a confidence interval about our estimate of the average standardized difference between group means. To do so we shall need the NDC program from Steigers page at http://www.statpower.net/Content/NDC/NDC.exe . Download and run that exe. Ask for a 90% CI and give the values of F and df: Click COMPUTE. You are given the CI for lambda, the noncentrality parameter: 13 Now we transform this confidence interval to a confidence interval for RMSSE by with the following transformation (applied to each end of the CI): n kRMSSE) 1 ( = . For the lower boundary, this yields 837 . 25 ) 3 (6998 . 120= , and for the upper boundary 393 . 55 ) 3 (3431 . 436= . That is, our estimate of the effect size is between King Kong-sized and beyond Godzilla-sized. Steiger noted that a test of the null hypothesis that + (the parameter estimated by RMSSE) = 0 is equivalent to the standard ANOVA F test if the confidence interval is constructed with 100(1-2)% confidence. For example, if the ANOVA were conducted with .05 as the criterion of statistical significance, then an equivalent confidence interval for + should be at 90% confidence -- + cannot be negative, after all. If the 90% confidence interval for + includes 0, then the ANOVA F falls short of significance, if it excludes 0, then the ANOVA F is significant. Power Analysis One-way ANOVA power analysis is detailed in out text book. The effect size may be specified in terms of Et2: 221) (errorjkjko | = ' =. Cohen used the symbol f for this same statistic, and considered an f of .10 to represent a small effect, .25 a medium effect, and .40 a large effect. In terms of percentage of variance explained 2, small is 1%, medium is 6%, and large is 14%. For example, suppose that I wish to test the null hypothesis that for GRE-Q, the population means for undergraduates intending to major in social psychology, clinical psychology, and experimental psychology are all equal. I decide that the minimum nontrivial effect size is if each mean differs from the next by 20 points (about 1/5 o ). For example, means of 480, 500, and 520. The Et2 is then 202 + 02 + 202 = 800. Next we compute |'. Assuming that the o is about 100, |'.163 . 0 10000 / 3 / 800 = = . Suppose we have 11 subjects in each group. | = |' - 54 . 11 163 . = - = n . Treatment df = 2, error df = 3(11 - 1) = 30. From the noncentral F table in our text book, for | = .50, dft = 2, dfe = 30, o =.05, | = 90%, thus power = 10%. How many subjects would be needed to raise power to 70%? | = .30. Go to the table, assuming that you will need enough subjects so that dfe = infinity. For | = .30, | = 1.6. Now, n = (|2)(k)(oe2) / Et2 = (1.6)2(3)(100)2 / 800 = 96. Now, 96 subjects per group would give you, practically speaking, infinite df. If N came out so low that dfe < 30, you would re-do the analysis with a downwards-adjusted dfe. One can define an effect size in terms of q2. For example, if q2 = 10%, then |' 33 .10 . 110 .122===qq. Suppose I had 6 subjects in each of four groups. If I employed an alpha-criterion of .05, how large [in terms of % variance in the DV accounted for by variance in the IV] would the effect need be for me to have a 90% chance of rejecting the null hypothesis? From the table, for dft = 3, dfe = 20, | = 2.0 for | = .13, and | = 2.2 for | = .07. By linear interpolation, for | = .10, | = 2.0 + (3/6)(.2) = 2.1. |' 857 . 061 . 2= = =n|. 14 q2 = |' 2 / (1 + |' 2 ) = .8572 / (1 + .8572) = 0.42, a very large effect! Do note that this method of power analysis does not ignore the effect of error df, as did the methods employed in Chapter 8. If you were doing small sample power analyses for independent t-tests, you should use the methods shown here (with k = 2), which will give the correct power figures (since F t = , ts power must be the same as Fs). Make it easy on yourself. Use G*Power to do the power analysis. APA-Style Summary Statement Teaching method significantly affected the students test scores, F(3, 16) = 86.66, MSE = 0.50, p < .001, q2 = .942, 95% CI [.858, .956]. As shown in Table 1, . Copyright 2012, Karl L. Wuensch - All rights reserved. CI-Eta2-Alpha Confidence Intervals for Squared Effect Size Estimates in ANOVA: What Confidence Coefficient Should be Employed? If you want the confidence interval to be equivalent to the ANOVA F test of the effect (which employs a one-tailed, upper tailed, probability) you should employ a confidence coefficient of (1 - 2). For example, for the usual .05 criterion of statistical significance, use a 90% confidence interval, not 95%. This is illustrated below. A two-way independent samples ANOVA was conducted and produced this output: Dependent Variable: PulseIncrease Sum of Source DF Squares Mean Square F Value Pr > F Model 3 355.95683 118.65228 3.15 0.0249 Error 380 14295.21251 37.61898 Corrected Total 383 14651.16933 R-Square Coeff Var Root MSE pulse Mean 0.024295 190.8744 6.133431 3.213333 Source DF Anova SS Mean Square F Value Pr > F Gender 1 186.0937042 186.0937042 4.95 0.0267 Image 1 63.6027042 63.6027042 1.69 0.1943 Gender*Image 1 106.2604167 106.2604167 2.82 0.0936 Eta-square and a corresponding 95% Confidence Interval will be computed for each effect. To put a confidence interval on the 2 we need to compute an adjusted F. To adjust the F we first compute an adjusted error term. For the main effect of gender, 867 . 371 38309 . 186 14651===Effect TotalEffect Totaldf dfSS SSMSE . In effect we are putting back into the error term all of the variance accounted for by other effects in our model. Now the adjusted F(1, 382) = 914 . 4867 . 3709 . 186= =GenderGenderMSEMS. For main effects, one can also get the adjusted F by simply doing a one way ANOVA with only the main effect of interest in the model: 2 proc ANOVA data=Katie; class Gender; model PulseIncrease = Gender; Dependent Variable: PulseIncrease Sum of Source DF Squares Mean Square F Value Pr > F Model 1 186.09370 186.09370 4.91 0.0272 Error 382 14465.07563 37.86669 Corrected Total 383 14651.16933 R-Square Coeff Var Root MSE PulseIncrease Mean 0.012702 191.5018 6.153592 3.213333 Source DF Anova SS Mean Square F Value Pr > F Gender 1 186.0937042 186.0937042 4.91 0.0272 Now use this adjusted F with the SAS or SPSS program for putting a confidence interval on R2. DATA ETA; ************************************************************************************************************** Construct Confidence Interval for Eta-Squared **************************************************************************************************************; F= 4.914 ; df_num = 1 ; df_den = 382; ncp_lower = MAX(0,fnonct (F,df_num,df_den,.975)); ncp_upper = MAX(0,fnonct (F,df_num,df_den,.025)); eta_squared = df_num*F/(df_den + df_num*F); eta2_lower = ncp_lower / (ncp_lower + df_num + df_den + 1); eta2_upper = ncp_upper / (ncp_upper + df_num + df_den + 1); output; run; proc print; var eta_squared eta2_lower eta2_upper; title 'Confidence Interval on Eta-Squared'; run; ------------------------------------------------------------------------------------------------- Confidence Interval on Eta-Squared eta_ eta2_ eta2_ Obs squared lower upper 1 0.012700 0 0.043552 SASLOG NOTE: Invalid argument to function FNONCT at line 57 column 19. F=4.914 df_num=1 df_den=382 ncp_lower=0 ncp_upper=17.485492855 eta_squared=0.0127004968 eta2_lower=0 eta2_upper=0.0435519917 _ERROR_=1 _N_=1 NOTE: Mathematical operations could not be performed at the following places. The results of the 3 operations have been set to missing values. Each place is given by: (Number of times) at (Line):(Column). Do not be concerned about this note. You will get every time your CI includes zero -- the iterative procedure bumps up against the wall at value = 0. Notice that the confidence interval includes the value 0 even though the effect of gender is significant at the .027 level. What is going on here? I think the answer can be found in Steiger (2004). Example 10: Consider a test of the hypothesis that = 0, that is, that the RMSSE (as defined in Equation 12) in an ANOVA is zero. This hypothesis test is one-sided because the RMSSE cannot be negative. To use a two-sided confidence interval to test this hypothesis at the = .05 significance level, one should examine the 100(1 - 2)% = 90% confidence interval for . If the confidence interval excludes zero, the null hypothesis will be rejected. This hypothesis test is equivalent to the standard ANOVA F test. Well, R2 (and 2) cannot be less than zero either. Accordingly, one can argue that when putting a CI on an ANOVA effect that has been tested with the traditional .05 criterion of significance, that CI should be a 90% CI, not a 95% CI. ncp_lower = MAX(0,fnonct (F,df_num,df_den,.95)); ncp_upper = MAX(0,fnonct (F,df_num,df_den,.05)); ------------------------------------------------------------------------------------------------ Confidence Interval on Eta-Squared eta_ eta2_ Obs squared eta2_lower upper 1 0.012700 .000743843 0.037453 The 90% CI does not include zero. Let us try another case. Suppose you obtained F(2, 97) = 3.09019. The obtained value of F here is exactly equal to the critical value of F for alpha = .05. F= 3.09019 ; df_num = 2 ; df_den = 97; ncp_lower = MAX(0,fnonct (F,df_num,df_den,.95)); 4ncp_upper = MAX(0,fnonct (F,df_num,df_den,.05)); . ------------------------------------------------------------------------------------------------ Confidence Interval on Eta-Squared eta_ eta2_ eta2_ Obs squared lower upper 1 0.059899 2.1519E-8 0.13743 Notice that the 90% CI does exclude zero, but barely. A 95% CI would include zero. Reference Steiger, J. H. (2004). Beyond the F test: Effect size confidence intervals and tests of close fit in the analysis of variance and contrast analysis. Psychological Methods, 9, 164-182, Karl L. Wuensch, Dept. of Psychology, East Carolina Univ., Greenville, NC USA September, 2009 Homogeneity of Variance Tests For Two or More Groups We covered this topic for two-group designs earlier. Basically, one transforms the scores so that between groups variance in the scores reflects differences in variance rather than differences in means. Then one does a t test on the transformed scores. If there are three or more groups, simply replace the t test with an ANOVA. See the discussion in the Engineering Statistics Handbook. Levene suggested transforming the scores by subtracting the within-group mean from each score and then either taking the absolute value of each deviation or squaring each deviation. Both versions are available in SAS. Brown and Forsythe recommended using absolute deviations from the median or from a trimmed mean. Their Monte Carlo research indicated that the trimmed mean was the best choice when the populations were heavy in their tails and the median was the best choice when the populations were skewed. The Brown and Forsythe method using the median is available in SAS. It would not be very difficult to program SAS to use the trimmed means. Obriens test is also available in SAS. I provide here SAS code to illustrate homogeneity of variance tests. The data are the gear data from the Engineering Statistics Handbook. options pageno=min nodate formdlim='-'; title 'Homogeneity of Variance Tests'; title2 'See http://www.itl.nist.gov/div898/handbook/eda/section3/eda35a.htm'; run; data Levene; input Batch N; Do I=1 to N; Input GearDiameter @@; output; end; cards; 1 10 1.006 0.996 0.998 1.000 0.992 0.993 1.002 0.999 0.994 1.000 2 10 0.998 1.006 1.000 1.002 0.997 0.998 0.996 1.000 1.006 0.988 3 10 0.991 0.987 0.997 0.999 0.995 0.994 1.000 0.999 0.996 0.996 4 10 1.005 1.002 0.994 1.000 0.995 0.994 0.998 0.996 1.002 0.996 5 10 0.998 0.998 0.982 0.990 1.002 0.984 0.996 0.993 0.980 0.996 6 10 1.009 1.013 1.009 0.997 0.988 1.002 0.995 0.998 0.981 0.996 7 10 0.990 1.004 0.996 1.001 0.998 1.000 1.018 1.010 0.996 1.002 8 10 0.998 1.000 1.006 1.000 1.002 0.996 0.998 0.996 1.002 1.006 9 10 1.002 0.998 0.996 0.995 0.996 1.004 1.004 0.998 0.999 0.991 10 10 0.991 0.995 0.984 0.994 0.997 0.997 0.991 0.998 1.004 0.997 *****************************************************************************; proc GLM data=Levene; class Batch; model GearDiameter = Batch / ss1; means Batch / hovtest=levene hovtest=BF hovtest=obrien; title; run; *****************************************************************************; proc GLM data=Levene; class Batch; model GearDiameter = Batch / ss1; means Batch / hovtest=levene(type=ABS); run; Here are parts of the statistical output, with annotations: Levene's Test for Homogeneity of GearDiameter Variance ANOVA of Squared Deviations from Group Means Sum of Mean Source DF Squares Square F Value Pr > F Batch 9 5.755E-8 6.394E-9 2.50 0.0133 Error 90 2.3E-7 2.556E-9 With the default Levenes test (using squared deviations), the groups differ significantly in variances. O'Brien's Test for Homogeneity of GearDiameter Variance ANOVA of O'Brien's Spread Variable, W = 0.5 Sum of Mean Source DF Squares Square F Value Pr > F Batch 9 7.105E-8 7.894E-9 2.22 0.0279 Error 90 3.205E-7 3.562E-9 Also significant with Obriens Test. Brown and Forsythe's Test for Homogeneity of GearDiameter Variance ANOVA of Absolute Deviations from Group Medians But not significant with the Brown & Forsythe test using absolute deviations from within-group medians. Sum of Mean Source DF Squares Square F Value Pr > F Batch 9 0.000227 0.000025 1.71 0.0991 Error 90 0.00133 0.000015 ------------------------------------------------------------------------------------------------- SAS will only let you do one Levene test per invocation of PROC GLM, so I ran GLM a second time to get the Levene test with absolute deviations. As you can see below, the difference in variances is significant with this test. Levene's Test for Homogeneity of GearDiameter Variance ANOVA of Absolute Deviations from Group Means Sum of Mean Source DF Squares Square F Value Pr > F Batch 9 0.000241 0.000027 2.16 0.0322 Error 90 0.00112 0.000012 The One-Way ANOVA procedure in PASW also provides a test of homogeneity of variance, as shown below. Test of Homogeneity of Variances GearDiameter Levene Statistic df1 df2 Sig. 2.159 9 90 .032 Notice that the Levene test provided by PASW is that using absolute deviations from within-group means. The Brown-Forsythe test offered as an option is not their test of equality of variances, it is a robust test of differences among means, like the Welch test. Return to Wuenschs Statistics Lessons Page Karl L. Wuensch May, 2010. Omega-Squared.doc Dear 6430 students, We have discussed omega-squared as a less biased (than is eta-squared) estimate of the proportion of variance explained by the treatment variable in the population from which our sample data could be considered to be random. Earlier this semester we discussed a very similar statistic, r-squared, and I warned you about how this statistic can be inflated by high levels of extraneous variable control. The same caution applies to eta-squared and omega-squared. Here is a comment I posted to EDSTAT-L on this topic a few years back: ------------------------------------------------------------------------------ Date: Mon, 11 Oct 93 11:27:23 EDT From: "Karl L. Wuensch" To: Multiple recipients of list Subject: Omega-squared (was P Value) Josh, [email protected], noted: >We routinely run omega squared on our data. Omega squared is one of the most >frequently applied methods in estimating the proportion of the dependent >variable accounted for by an independent variable, and is used to confirm the >strength of association between variables in a population. ............ Omega-squared can also be misinterpreted. If the treatment is evaluated in circumstances (the laboratory) where the influence of extraneous variables (other variables that influence the dependent variable) is eliminated, then the omega-squared will be inflated relative to the proportion of the variance in the dependent variable due to the treatment in a (real) population where those extraneous variables are not eliminated. Thus, a treatment that really accounts for a trivial amount of the variance in the dependent variable out there in the real world can produce a large omega-squared when computed from data collected in the laboratory. To a great extent both P and omega-squared measure the extent to which the researcher has been able to eliminate "error variance" when collecting the data (but P is also greatly influenced by sample size). Imagine that all your subjects were clones of one another with identical past histories. All are treated in exactly the same way, except that for half of them you clapped your hands in their presence ten minutes before measuring whatever the dependent variable is. Because the subjects differ only on whether or not you clapped your hands in their presence, if such clapping has any effect at all, no matter how small, it accounts for 100% of the variance in your sample. If the population to which you wish to generalize your results is not one where most extraneous variance has been eliminated, your omega-squared may be a gross overestimate of the magnitude of the effect. Do note that this problem is not unique to omega-squared. Were you to measure the magnitude of the effect as being the between groups difference in means divided by the within groups standard deviation the same potential for inflation of effect would exist. Karl L. Wuensch, Dept. of Psychology, East Carolina Univ. Greenville, NC 27858-4353, phone 919-757-6800, fax 919-757-6283 Bitnet Address: PSWUENSC@ECUVM1 Internet Address: [email protected] ======================================================================== Sender: [email protected] From: Joe H Ward Karl --- good comment!! My early research days were spent in an R-squared, Omega-squared, Factor Analysis environment. My own observations say: "BEWARE of those correlation-type indicators!!!" --- Joe Joe Ward 167 East Arrowhead Dr. San Antonio, TX 78228-2402 Phone: 210-433-6575 [email protected] MultComp.doc One-Way Multiple Comparisons Tests Error Rates The error rate per comparison, pc , is the probability of making a Type I error on a single comparison, assuming the null hypothesis is true. The error rate per experiment, PE , is the expected number of Type I errors made when making c comparisons, assuming that each of the null hypotheses is true. It is equal to the sum of the per comparison alphas. If the per comparison alphas are constant, then PE = c pc, The familywise error rate, fw , is the probability of making one or more Type I errors in a family of c comparisons, assuming that each of the c null hypotheses is true. If the comparisons are independent of one another (orthogonal), then ( )cpc fw = 1 1 . For our example problem, evaluating four different teaching methods, if we were to compare each treatment mean with each other treatment mean, c would equal 6. If we were to assume those 6 comparisons to be independent of each other (they are not), then 26 . 95 . 16= =fw . Multiple t tests One could just use multiple t-tests to make each comparison desired, but one runs the risk of greatly inflating the familywise error rate (the probability of making one or more Type I errors in a family of c comparisons) when doing so. One may use a series of protected t-tests in this situation. This procedure requires that one first do an omnibus ANOVA involving all k groups. If the omnibus ANOVA is not significant, one stops and no additional comparisons are done. If that ANOVA is significant, one makes all the comparisons e wishes using t-tests. If you have equal sample sizes and homogeneity of variance, you can use nMSEX Xtj i=2, which pools the error variance across all k groups, giving you N - k degrees of freedom. If you have homogeneity of variance but unequal ns use: +=j ij in nMSEX Xt1 1. MSE is the error mean square from the omnibus ANOVA. If you had heterogeneous variances, you would need to compute separate variances t-tests, with adjusted df. The procedure just discussed (protected t-tests) is commonly referred to as Fishers LSD test. LSD stands for Least Significant Difference. If you were making comparisons for several pairs of means, and n was the same in each sample, and you Copyright 2010, Karl L. Wuensch - All rights reserved. 2were doing all your work by hand (as opposed to using a computer), you could save yourself by solving substituting the critical value of t in the formula above, entering the n and the MSE, and then solving for the (smallest) value of the difference between means which would be significant (the least significant difference). Then you would not have to compute a t for each comparison, you would just find the difference between the means and compare that to the value of the least significant difference. While this procedure is not recommended in general (it does not adequately control familywise alpha), there is one special case when it is the best available procedure, and that case is when k = 3. In that case Fishers procedure does hold fw at or below the stated rate and has more power than other commonly employed procedures. The interested student is referred to the article A Controlled, Powerful Multiple-Comparison Strategy for Several Situations, by Levin, Serlin, and Seaman (Psychological Bulletin, 1994, 115: 153-159) for details and discussion of how Fishers procedure can be generalized to other 2 df situations. Linear Contrasts One may make simple or complex comparisons involving some or all of a set of k treatment means by using linear contrasts. Suppose I have five means, which I shall label A, B, C, D, and E. I can choose contrast coefficients to compare any one mean or subset of these means with any other mean or subset of these means. The sum of the contrast coefficients must be zero. All of the coefficients applied to the one set of means must be positive, all those applied to the other set must be negative. Means left out of the contrast get zero coefficients. There are some advantages of using a standard set of weights, so I shall do so here: The coefficients for the one set must equal +1 divided by the number of conditions in that set while those for the other set must equal -1 divided by the number of conditions in that other set. The sum of the absolute values of the coefficients must be 2. Suppose I want to contrast combined groups A and B with combined groups C, D, and E. A nonstandard set of coefficients is 3, 3, 2, 2, 2, and a standard set of coefficients would be .5, .5, 1/3, 1/3, 1/3. If I wanted to contrast C with combined D and E , a nonstandard set of coefficients would be 0, 0, -2, 1, 1 and a standard set of coefficients would be 0, 0, 1, .5, .5. A standard contrast is computed as i iM c = . To test the significance of a contrast, compute a contrast sum of squares this way: =jjncSS22 . When the sample sizes are equal, this simplifies to =22jcnSS . Each contrast will have only one treatment df, so the contrast MS is the same as the contrast SS. To get an F for the contrast just divide it by an appropriate MSE (usually that which would be obtained were one to do an omnibus ANOVA on all k treatment groups). For our example problem, suppose we want to compare combined groups C and D with combined groups A and B. The A, B, C, D means are 2, 3, 7, 8, and the 3coefficients are .5, ,5, +.5, +.5. 5 ) 8 ( 5 . ) 7 ( 5 . ) 3 ( 5 . ) 2 ( 5 . = + + = . Note that the value of the contrast is quite simply the difference between the mean of combined groups C and D (7.5) and the mean of combined groups A and B (2.5). 1251) 25 ( 525 . 25 . 25 . 25 .) 5 ( 52 = =+ + +=MS , and F(1, 16) = 125/.5 = 250, p .01 b. B vs C: 66 . 12316 .3 7== q , p < .01 7 c. C vs D: 16 . 3316 .7 8== q , p > .01 6. Group A B C D Mean 2 3 7 8 What if there are unequal sample sizes? One solution is to use the harmonic mean sample size computed across all k groups. That is, =jnkn1~. Another solution is to compute for each comparison made the harmonic mean sample size of the two groups involved in that comparison, that is, j in nn1 12~+= . With the first solution the effect of n (bigger n, more power) is spread out across groups. With the latter solution comparisons involving groups with larger sample sizes will have more power than those with smaller sample sizes. If you have disparate variances, you should compute a q that is very similar to the separate variances t-test earlier studied. The formula is: 2 222tnSnSX Xqjjiij i =+= , where t is the familiar separate variances t-test. This procedure is known as the Games and Howell procedure. When using this unpooled variances q one should also adjust the degrees of freedom downwards exactly as done with Satterthwaites solution previously discussed. Consult our text book for details on how the SNK can have a familywise alpha that is greatly inflated if the omnibus null hypothesis is only party true. Relationship Between q And Other Test Statistics The studentized range statistic is closely related to t and to F. If one computes the pooled-across-k -groups t, as done with Fishers LSD, then 2 t q = . If one computes an F from a planned comparison, then F q = 2 . For example, for the A vs C comparison with our sample problem: . 18 . 11447 .55) 5 (. 22 7= == t 2 18 . 11 82 . 15 = = q . 8 The contrast coefficients to compare A with C would be .5, 0, .5, 0. The contrast ( )5 . 625 .) 25 . 6 ( 5) 25 . 0 0 25 (.) 8 0 7 5 . 3 0 2 5 . ( 5222= =+ + + + + + = =jj jaM a nSS . 55 . 62= F , 82 . 15 125 2 = = q . Tukeys (a) Honestly Significant Difference Test This test is applied in exactly the same way that the Student-Newman-Keuls is, with the exception that r is set at k for all comparisons. This test is more conservative (less powerful) than the Student-Newman-Keuls. Tukeys (b) Wholly Significant Difference Test This test is a compromise between the Tukey (a) and the Newman-Keuls. For each comparison, the critical q is set at the mean of the critical q were a Tukey (a) being done and the critical q were a Newman-Keuls being done. Ryans Procedure (REGWQ) This procedure, the Ryan / Einot and Gabriel / Welsch procedure, is based on the q statistic, but adjusts the per comparison alpha in such a way (Howell provides details in our text book) that the familywise error rate is maintained at the specified value (unlike with the SNK) but power will be greater than with the Tukey(a). I recommend its use with four or more groups. With three groups the REGWQ is identical to the SNK, and, as you know, I recommend Fishers procedure when you have three groups. With four or more groups I recommend the REGWQ, but you cant do it by hand, you need a computer (SAS and SPSS will do it). Other Procedures Dunns Test (The Bonferroni t ) Since the familywise error rate is always less than or equal to the error rate per experiment, pc fwc , an inequality known as the Bonferroni inequality, one can be sure that alpha familywise does not exceed some desired maximum value by using an adjusted alpha per comparison that equals the desired maximum alpha familywise divided by c, that is, cfwpc = . In other words, you compute a t or an F for each desired comparison, usually using an error term pooled across all k groups, obtain an exact p from the test statistic, and then compare that p to the adjusted alpha per comparison. Alternatively, you could multiply the p by c and compare the resulting ensemble-adjusted p to the maximum acceptable familywise error rate. Dunn has provided special tables which give critical values of t for adjusted per comparison alphas, in case you cannot obtain the exact p for a comparison. For example, t120 = 2.60, alpha familywise = .05, c = 6. Is this t significant? You might have trouble finding a critical value for t at per comparison alpha = .05/6 in a standard t table. 9Dunn-Sidak Test Sidak has demonstrated that the familywise error rate for c nonorthogonal (nonindependent) comparisons is less than or equal to the familywise error rate for c orthogonal comparisons, that is, ( )cpc fw 1 1 . Thus, one can adjust the per comparison alpha by using an adjusted criterion for rejecting the null: Reject the null only if ( ) [ ]cfwp/ 11 1 . This procedure, the Dunn-Sidak test, is more powerful than the Dunn test, especially when c is large. Scheff Test To conduct this test one first obtains F-ratios testing the desired comparisons. The critical value of the F is then adjusted (upwards). The adjusted critical F equals (the critical value for the treatment effect from the omnibus ANOVA) times (the treatment degrees of freedom from the omnibus ANOVA). This test is extremely conservative (low power) and is not generally recommended for making pairwise comparisons. It assumes that you will make all possible linear contrasts, not just simple pairwise comparisons. It is considered appropriate for making complex comparisons (such as groups A and B versus C, D, & E; A, B, & C vs D & E; etc., etc.). Dunnetts Test The Dunnett t is computed exactly as is the Dunn t, but a different table of critical values is employed. It is employed when the only comparisons being made are each treatment group with one control group. It is somewhat more powerful than the Dunn test for such an application. Presenting the Results of Pairwise Contrasts I recommend using a table like that below. You should also give a brief description of the pattern of significant differences between means, but do not mention each contrast that was made in a large set of pairwise contrasts. Teaching method significantly affected test scores, F(3, 16) = 86.66, MSE = 0.50, p < .001, 2 = .94, 95% CI [.82, .94]. Pairwise comparisons were made with Tukeys HSD procedure, holding familywise error at a maximum of .01. As shown in Table 1, the computer intensive and discussion centered methods were associated with significantly better student performance than that shown by students taught with the actuarial and book only methods. All other comparisons fell short of statistical significance. Table 1 Mean Quiz Performance By Students Taught With Different Methods Method of Instruction Mean Actuarial 2.00A Book Only 3.00A Computer Intensive 7.00B Discussion Centered 8.00B 10Note. Means sharing a letter in their superscript are not significantly different at the .01 level according to a Tukey HSD test. Familywise Error, Alpha-Adjustment, and the Boogey Men One should keep in mind that the procedures discussed above can very greatly lower power. Unless one can justify taking a much greater risk of a Type II error as the price to pay for keeping the conditional probability of a Type I error unreasonably low, I think it not good practice to employ these techniques. So why do I teach them? Well, because others will expect you (and me) to use these techniques even if we think them unwise. Please read my rant about this: Familywise Alpha and the Boogey Men. Can I Make These Comparisons Even If The ANOVA Is Not Significant? Yes, with the exception of Fishers procedure. The other procedures were developed to be used instead of ANOVA, not following a significant ANOVA. You dont even need to do the ANOVA, and if you do, it does not need to be significant to be permitted to make multiple comparisons. There is much misunderstanding about this. Please read Pairwise Comparisons. fMRI Gets Slap in the Face with a Dead Fish an example of research where so many comparisons are made that one spurious effects are very likely to be found. Return to Karls Stats Lessons Page Copyright 2010, Karl L. Wuensch - All rights reserved. G*Power: One-Way Independent Samples ANOVA See the power analysis done by hand in my document One-Way Independent Samples Analysis of Variance. Here I shall do it with G*Power. We want to know how much power we would have for a three-group ANOVA where we have 11 cases in each group and the effect size in the population is 163 . = = f . When we did by hand, using the table in our text book, we found power = 10%. Boot up G*Power: Click OK. Click OK again on the next window. Click Tests, F-Test (Anova). Under Analysis, select Post Hoc. Enter .163 as the Effect size f, .05 as the Alpha, 33 as the Total sample size, and 3 as the number of Groups. Click Calculate. G*Power tell you that power = .1146. OK, how many subjects would you need to raise power to 70%? Under Analysis, select A Priori, under Power enter .70, and click Calculate. G*Power advises that you need 294 cases, evenly split into three groups, that is, 98 cases per group. Alt-X, Discard to exit G*Power. That was easy, wasnt it? Links Karl Wuenschs Statistics Lessons Internet Resources for Power Analysis Karl L. Wuensch Dept. of Psychology East Carolina University Greenville, NC USA PSY-ANOVA1.doc PSY: A Program for Contrast Analysis At http://www.psy.unsw.edu.au/research/PSY.htm one can obtain PSY: A program for contrast analysis, by Kevin Bird, Dusan Hadzi-Pavlovic, and Andrew Isaac. This program computes unstandardized and approximate standardized confidence intervals for contrasts with between-subjects and/or within/subjects factors. It will also compute simultaneous confidence intervals. Contrast coefficients are provided as integers, and the program converts them to standard weights. Here is a properly formatted input data set for the data in my handout One-Way Independent Samples Analysis of Variance. Contrast coefficients are provided to compare combined groups A and B with combined groups C and D, A with B, and C with D, which happens to comprise a complete set of orthogonal contrasts. 1 1 1 2 1 2 1 2 1 3 2 2 2 3 2 3 2 3 2 4 3 6 3 7 3 7 3 7 3 8 4 7 4 8 4 8 4 8 4 9 [BetweenContrasts] -1 -1 1 1 -1 1 0 0 0 0 -1 1 Here is the output from PSY ============================================================================ PSY ============================================================================ Date: 12/23/2005 Time: 3:46:53 PM File: C:\Documents and Settings\Karl Wuensch\My Documents\Temp\psy\ANOVA1.in ------------------------------------------------------------------------------- Number of Groups: 4 Number of Measurements: 1 Number of subjects in... 2 Group 1: 5 Group 2: 5 Group 3: 5 Group 4: 5 Between contrast coefficients Contrast Group... 1 2 3 4 B1 -1 -1 1 1 B2 -1 1 0 0 B3 0 0 -1 1 Means and Standard Deviations Group 1 Overall Mean: 2.000 Measurement 1 Mean 2.000 SD 0.707 Group 2 Overall Mean: 3.000 Measurement 1 Mean 3.000 SD 0.707 Group 3 Overall Mean: 7.000 Measurement 1 Mean 7.000 SD 0.707 Group 4 Overall Mean: 8.000 Measurement 1 Mean 8.000 SD 0.707 Means and SDs averaged across groups Measurement 1 Mean 5.000 SD 0.707 -------------------- Analysis of Variance Summary Table Source SS df MS F ------------------------------------------------ Between ------------------------------------------------ B1 125.000 1 125.000 250.000 B2 2.500 1 2.500 5.000 B3 2.500 1 2.500 5.000 Error 8.000 16 0.500 ------------------------------------------------ Individual 95% Confidence Intervals ----------------------------------- The CIs refer to mean difference contrasts, with coefficients rescaled if necessary. The rescaled contrast coefficients are: 3 Rescaled Between contrast coefficients Contrast Group... 1 2 3 4 B1 -0.500 -0.500 0.500 0.500 B2 -1.000 1.000 0.000 0.000 B3 0.000 0.000 -1.000 1.000 Raw CIs (scaled in Dependent Variable units) ------------------------------------------------------- Contrast Value SE ..CI limits.. Lower Upper ------------------------------------------------------- B1 5.000 0.316 4.330 5.670 B2 1.000 0.447 0.052 1.948 B3 1.000 0.447 0.052 1.948 ------------------------------------------------------- Approximate Standardized CIs (scaled in Sample SD units) ------------------------------------------------------- Contrast Value SE ..CI limits.. Lower Upper ------------------------------------------------------- B1 7.071 0.447 6.123 8.019 B2 1.414 0.632 0.073 2.755 B3 1.414 0.632 0.073 2.755 ------------------------------------------------------- Return to Karls Stats Lessons Page Karl L. Wuensch Dept. of Psychology East Carolina University Greenville, NC USA December, 2005 Strength_of_Effect.doc Reporting the Strength of Effect Estimates for Simple Statistical Analyses This document was prepared as a guide for my students in Experimental Psychology. It shows how to present the results of a few simple but common statistical analyses. It also shows how to compute commonly employed strength of effect estimates. Independent Samples T When we learned how to do t tests (see T Tests and Related Statistics: SPSS), you compared the mean amount of weight lost by participants who completed two different weight loss programs. Here is SPSS output from that analysis: Group Statistics6 22.67 4.274 1.74512 13.25 4.093 1.181GROUP12LOSSN Mean Std. DeviationStd. ErrorMean The difference in the two means is statistically significant, but how large is it? We can express the difference in terms of within-group standard deviations, that is, we can compute the statistic commonly referred to as Cohens d, but more appropriately referred to as Hedges g. Cohens d is a parameter. Hedges g is the statistic we use to estimate d. First we need to compute the pooled standard deviation. Convert the standard deviations to sums of squares by squaring each and then multiplying by (n-1). For Group 1, (5)4.2742 = 91.34. For Group 2, (11)4.0932 = 184.28. Now compute the pooled standard deviation this way: 15 . 41628 . 184 34 . 9122 12 1=+= ++=n nSS SSspooled. Finally, simply standardize the difference in means: 27 . 215 . 425 . 13 67 . 222 1===pooledsM Mg , a very large effect. An easier way to get the pooled standard deviation is to conduct an ANOVA relating the test variable to the grouping variable. Here is SPSS output from such an analysis: Copyright 2006, Karl L. Wuensch - All rights reserved. 2ANOVALOSS354.694 1 354.694 20.593 .000275.583 16 17.224630.278 17Between GroupsWithin GroupsTotalSum ofSquares df Mean Square F Sig. Now you simply take the square root of the within groups mean square. That is, SQRT(17.224) = 4.15 = the pooled standard deviation. An easier way to get the value of g is to use one of my programs for placing a confidence interval around our estimate of d. See my document Confidence Intervals, Pooled and Separate Variances T. Here is an APA-style summary of the results: Persons who completed weight loss program 1 lost significantly more weight (M = 22.67, SD = 4.27, n = 6) than did those who completed weight loss program 2 (M = 13.25, SD = 4.09, n = 12), t(9.71) = 4.47, p = .001, g = 2.27. Do note that I used the separate variances t here I had both unequal sample sizes and disparate sample variances. Also note that I reported the sample sizes, which are not obvious from the df when reporting a separate variances test. You should also recall that the difference in sample sizes here was cause for concern (indicating a problem with selective attrition). One alternative strength of effect estimate that can be used here is the squared point-biserial correlation coefficient, which will tell you what proportion of the variance in the test variable is explained by the grouping variable. One way to get that statistic is to take the pooled t and substitute in this formula: . 56 .2 12 6 538 . 4538 . 42222 1222= + += + +=n n ttrpb An easier way to get that statistic to compute the r between the test scores and the numbers you used to code group membership. SPSS gave me this: Correlations-.75018Pearson CorrelationNGROUPLOSS When I square -.75, I get .56. Another way to get this statistic is to do a one-way ANOVA relating groups to the test variable. See the output from the ANOVA above. The eta-squared statistic is . 56 .278 . 630694 . 354= =totalbetweenSSSS Please note that 2 is the same as the squared point-biserial correlation coefficient (when you have only two groups). When you use SAS to do ANOVA, you are given the 2 statistic with the standard output (SAS calls it R2). Here is an APA-style summary of the results with eta-squared. 3 Persons who completed weight loss program 1 lost significantly more weight (M = 22.67, SD = 4.27, n = 6) than did those who completed weight loss program 2 (M = 13.25, SD = 4.09, n = 12), t(9.71) = 4.47, p = .001, 2 = .56. One-Way Independent Samples ANOVA The most commonly employed strength of effect estimates here are 2 and 2 (consult your statistics text or my online notes on ANOVA to see how to compute 2). I have shown above how to compute 2 as a ratio of the treatment SS to the total SS. If you have done a trend analysis (polynomial contrasts), you should report not only the overall treatment 2 but also 2 for each trend (linear, quadratic, etc.) Consult the document One-Way Independent Samples ANOVA with SPSS for an example summary statement. Dont forget to provide a table with group means and standard deviations. If you have made comparisons between pairs of means, it is a good idea to present d or 2 for each such comparison, although that is not commonly done. Look back at the document One-Way Independent Samples ANOVA with SPSS and see how I used a table to summarize the results of pairwise comparisons among means. One should also try to explain the pattern of pairwise results in text, like this (for a different experiment): The REGWQ procedure was used to conduct pairwise comparisons holding familywise error at a maximum of .05 (see Table 2). The elevation in pulse rate when imagining infidelity was significantly greater for men than for women. Among men, the elevation in pulse rate when imagining sexual infidelity was significantly greater than when imagining emotional infidelity. All other pairwise comparisons fell short of statistical significance. Correlation/Regression Analysis You will certainly have reported r or r2, and that is sufficient as regards strength of effect. Here is an example of how to report the results of a regression analysis, using the animal rights and misanthropy analysis from the document Correlation and Regression Analysis: SPSS: Support for animal rights (M = 2.38, SD = 0.54) was significantly correlated with misanthropy (M = 2.32, SD = 0.67), r = .22, animal rights = 1.97 + .175 Misanthropy, n = 154, p =.006. Contingency Table Analysis Phi, Cramers phi (also known as Cramers V) and odds ratios are appropriate for estimating the strength of effect between categorical variables. Please consult the document Two-Dimensional Contingency Table Analysis with SPSS. For the analysis done there relating physical attractiveness of the plaintiff with verdict recommended by the juror, we could report: Guilty verdicts were significantly more likely when the plaintiff was physically attractive (76.7%) than when she was physically unattractive (54.2%), 2(1, N = 145) = 6.23, p = .013, C = .21, odds ratio = 2.8. Usually I would not report both C and an odds ratio. 4 Cramers phi is especially useful when the effect has more than one df. For example, for the Weight x Device crosstabulation discussed in the document Two-Dimensional Contingency Table Analysis with SPSS, we cannot give a single odds ratio that captures the strength of the association between a persons weight and the device (stairs or escalator) that person chooses to use, but we can use Cramers phi. If we make pairwise comparisons (a good idea), we can employ odds ratios with them. Here is an example of how to write up the results of the Weight x Device analysis: As shown in Table 1, shoppers choice of device was significantly affected by their weight, 2(2, N = 3,217) = 11.75, p = .003, C = .06. Pairwise comparisons between groups showed that persons of normal weight were significantly more likely to use the stairs than were obese persons, 2(1, N = 2,142) = 9.06, p = .003, odds ratio = 1.94, as were overweight persons, 2(1, N = 1,385) = 11.82, p = .001, odds ratio = 2.16, but the difference between overweight persons and normal weight persons fell short of statistical significance, 2(1, N = 2,907) = 1.03, p = .31, odds ratio = 1.12. Table 1 Percentage of Persons Using the Stairs by Weight Category Category Percentage Obese 7.7 Overweight 15.3 Normal 14.0 Of course, a Figure may look better than a table, for example, 05101520PercentageObese Overweight NormalWeight CategoryFigure 1. Percentage Use of Stairs For an example of a plot used to illustrate a three-dimensional contingency table, see the document Two-Dimensional Contingency Table Analysis with SPSS. 5 Copyright 2005, Karl L. Wuensch - All rights reserved. T Tests, ANOVA, and Regression Analysis Here is a one-sample t test of the null hypothesis that mu = 0: DATA ONESAMPLE; INPUT Y @@; CARDS; 1 2 3 4 5 6 7 8 9 10 PROC MEANS T PRT; RUN; ------------------------------------------------------------------------------------------------ The SAS System The MEANS Procedure Analysis Variable : Y t Value Pr > |t| 5.74 0.0003 ------------------------------------------------------------------------------------------------ Now an ANOVA on the same data but with no grouping variable: PROC ANOVA; MODEL Y = ; run; ------------------------------------------------------------------------------------------------ The SAS System The ANOVA Procedure Dependent Variable: Y Sum of Source DF Squares Mean Square F Value Pr > F Model 1 302.5000000 302.5000000 33.00 0.0003 Error 9 82.5000000 9.1666667 Uncorrected Total 10 385.0000000 R-Square Coeff Var Root MSE Y Mean 0.000000 55.04819 3.027650 5.500000 Source DF Anova SS Mean Square F Value Pr > F Intercept 1 302.5000000 302.5000000 33.00 0.0003 ------------------------------------------------------------------------------------------------ Notice that the ANOVA F is simply the square of the one-sample t, and the one-tailed p from the ANOVA is identical to the two-tailed p from the t. Now an Regression analysis with Model Y = intercept + error. PROC REG; MODEL Y = ; run; ----------------------------------------------------------------------