beginers guide to statistics

8/6/2019 Beginers Guide to Statistics

1/18

1

A Beginners Guide to Statistical Testing Denton Bramwell, Nov 15, 1999

Promontory Management Group Inc., [email protected]

Statistics was invented as an occupation in order to provide employment for those who do

not have enough personality to go into accounting. I guess that explains why I took up

Six Sigma, with its concomitant involvement in statistics.

In this short paper, Ill explain in the simplest way I can the practical fundamentals of afew key statistical tests. The statistical software that I use is Minitab version 12. [Note:

Version 13 has since been released.]

Data Types

There are four types of data. It is important to know which kind of data you are dealing

with, since this determines the types of test you can do. The four types are

Nominal or categorical: Classes things into categories, such as pass/fail, old/new,

red/blue/green, or died/survived. Ordinal: Ranks things. Team B is stronger than C, which, is in turn stronger than

A.

Interval: Things you can add and subtract, like degrees Celsius or Fahrenheit.Interval data can be continuous or discrete. Continuous data can take on anyvalue, such as the actual air pressure inside a tire. Discrete data comes in steps,like money. In the US, the smallest increment of money is one cent, so

expressions of money normally come in steps of one cent.

Ratio: Similar to interval data, but zero indicates a total absence of a property.Temperature expressed in Kelvins is ratio data. Like interval data, ratio data maybe continuous or discrete.

Nominal data yields the least information. Ordinal data is better. Interval or ratio data isbest.

Suppose you have been told that you will be taking a long trip, to 10 differentdestinations, somewhere in the world. If you are told only that five are in the northern

hemisphere, and that five are in the southern hemisphere, you have no idea what kind ofclothes to pack. That is because you have been given categorical data, and categorical

data conveys the least information of any of the data types. On the other hand, if you aregiven the exact coordinates of your 10 destinations, you will know exactly what to pack.Expressing positions on the Earths surface requires ratio data, and ratio data conveys the

most information. If possible, use interval or ratio data in conducting your investigations.

Basics of Hypothesis Testing

The null hypothesis, Ho, is the dull hypothesis: Nothing interesting is going on. For

example, it makes no difference whether people receive an antibiotic or a placebo.


2/18


3/18

3

The Chi Square Test

This is a test using nominal data. It makes very few assumptions, so it can be used whereother tests are useless. The bad news is that it can require a lot of data, and the results are

sometimes hard to interpret.

This test is performed on data organized into rows and columns. For example, we might

want to test survival of patients who are given one of three types of care in a lifethreatening situation. The data array might look like

Treatment 1 Treatment 2 Treatment 3

Survived 21 40 19

Died 9 2 11

The null hypothesis of the Chi Square test is that the rows and columns are statistically

independent. That is, if I know which column a case is in, I cannot make much betterthan a random guess as to which row it is from, or vice versa.

Lets now go to Minitab and perform the test. Enter the data table, and choose

STATS|TABLES|CHI SQUARE. You will obtain this report:

1 21 40 19 80

23.53 32.94 23.53

2 9 2 11 22

6.47 9.06 6.47

Total 30 42 30 102

Chi-Sq = 0.272 + 1.513 + 0.872 +

0.989 + 5.500 + 3.171 = 12.316

DF = 2, P-Value = 0.002

You can see that our original input is printed, with row and column sums. Beneath eachof the original entries is an expected value for each cell. These are the numbers 23.53,

32.94, and so on. If any of these values is below 5, it means that you have not collectedenough data to run this test, and cannot depend on the result. In this case, our lowest cell

has an expected value of 6.47, so we are good to go.

The interpretation of this test lies in the P value, which is our risk, or probability ofbeing wrong if we assert there is a difference. It says that we would run a .2% chance ofbeing wrong if we asserted that the treatments produced different survival rates. Thats a

strong outcome. The weakness of the test is that all you really statistically know is thatthey are not the same. You dont officially know which treatment is better or worse.


4/18

4

The Tukey Tail Count

The B vs. C, or Tukey Tail Test is extremely easy to use, and reasonably powerful. It isnonparametric. That means that it does not depend on the data being normally

distributed, so you can use this test when you have really ugly data, and still get

beautiful results. Another good feature of the B vs. C test is that strong statisticalindications can be found with very small amounts of data. This test uses ordinal data.

C represents one process that we want to evaluate, usually the current process that we

want to replace. B represents the other process, usually the better process that we wantto put in place. John Tukey is the statistician that developed and popularized the test thatB vs. C is based on. The data analysis is based on counting Bs and Cs on the ends, or

tails of a distribution. Thats where the name Tukey Tail Test comes from.

This test absolutely requires randomization and/or blocking. It is tempting to just run afew Cs, then switch over and run a few Bs, and make a decision. This is an invitation to

error. You must either block or randomize such variables as age, gender, or other factorsthat might influence the outcome.

Suppose that you randomly select just 3 items from your C process, and 3 items fromyour B process, and that all 3 of your Bs were better than your Cs. Intuitively, youdprobably feel that you were on to something, and that your B process was indeed better.

Statistics support your intuition.

There are just 20 different orders you can put 3 Bs and 3 Cs into (Try it!). Only one ofthose is 3 Bs above 3 Cs. You have just one chance in 20, or .05 probability of arrivingat such an arrangement by chance. So you if all 3 Bs are better than all 3 Cs, you can

be 95% confident that B is indeed better than C, and you have arrived at this with onlysix samples.

One other nicety is that this works for cosmetic issues. You dont really have to be ablemeasure goodness. You just have to be able to rank your Bs and Cs from highest to

lowest. So if you want to test the attractiveness of men/women in the fashion modelingbusiness vs. the attractiveness of men/women in med school, this is your test, assuming

you can put all of them in rank order of attractiveness.

The following table shows the number of Bs and Cs you need, for a given level of

risk. Note that for each level of, there are several acceptable combinations of B and C

sample sizes. Remember, you must randomize your selections. Also remember, all yourBs must outrank all your Cs. Note that the table includes our 3 B vs. 3 C example.


5/18

5

Risk Number of Bsamples

Number of C

samples

2 43

3 16

4 105 8

.001

6 6

2 13

3 7

4 5

.01

5 4

1 19

2 5

3 3

.05

4 3

1 92 3.1

3 2

There is another variation on this test that is useful, and in many cases preferred to thesystem already shown. In this system, all the Bs do not need to outrank all the Cs. It

does require that the number of B samples and the number of C samples must beapproximately equal. If the ratio between the size of the B sample and the C sample falls

in the range of 3:4 to 4:3, the sample sizes are near enough to being equal. This test alsoabsolutely requires randomization.

Suppose that you draw a random sample of 6 Bs and 6 Cs, and that you rank them inorder of goodness. Your distribution might look like this:

BB

BB

CB

C

CB

CC

C

Best

Worst

Pure Bs. This is yourB end count.

Mixed Bs and Cs.

Disregard.

Pure Cs. This is yourC end count.


6/18

6

Now just add your B end count to your C end count, and refer to this chart for your levelof significance.

Risk B+C End CountAt Least

.1 6.05 7

.01 10

.001 13

Since our total end count is 7, we can be 95% confident that B is better than C.

The Two Sample T Test

This is a powerful test, using interval or ratio data. Its null hypothesis is that the meansof two normally distributed populations are the same. The alternate hypothesis can be

that they are not equal, or that the mean of A is greater/less than the mean of B.

This test requires that fundamental assumptions are met.

The data are normally distributed. A normal distribution is also called a Gaussiandistribution. Actually the test is pretty robust, and will still give good results withpretty non-normal data.

The data are stable during the sampling period. This assumption is veryimportant.

You must demonstrate that you meet these assumptions if you want to cruise down the

Gaussian superhighway. Fortunately, this is easy.

0 5 10 15 20 25

80

90

100

110

120

Observation Number

IndividualValue

I Chart for DATA1

X=100.1

3.0SL=117.6

-3.0SL=82.61

First, lets check stability for DATA1. There are several ways to do this, but my favoriteis the Individuals Control Chart. This is run by clicking STATS|CONTROL


7/18

7

CHARTS|INDIVIDUALS. The basic thing were doing is verifying that there are notrends or sudden, major shifts or trends in the data (stability). This chart spreads the data

out in time, and allows us to inspect it. DATA1 is just fine.

DATA2 is not so fine. I see an upward trend in the first dozen data points, followed by a

sudden downward shift, then a sudden upward shift at about data point 19. This datadoes not meet our assumption of stability.

2520151050

130

120

110

100

90

80

Observation Number

IndividualValue

I Chart for DATA2

1

X=106.6

3.0SL=129.0

-3.0SL=84.29

151050

30

20

10

0

-10

-20

-30

Observation Number

IndividualValue

I Chart for DATA3

X=0.6667

3.0SL=25.36

-3.0SL=-24.03

DATA3 has a discrimination problem. The only values that the data can take on are 10,-5, 0, 5, and 10. Since the data can only exist in one of five states, it will not pass a

normality test. Ive researched the matter, and have not been able to get an accuratestatement of how much discrimination is enough. My opinion is that you should see atleast 10 different values in the data, and there is some reasoning behind this opinion,

which we dont have space to discuss here.

We have now concluded that DATA2 and DATA3 are not prime candidates for the

Gaussian./normal model. This does not mean that we absolutely cannot apply the model,but it does mean that if we do apply it, we use the results with appropriate caution.


8/18

8

DATA1 has met both of our tests so far, and requires only one more test. That is for

normality. This test is done by clicking STAT|BASIC STATS|NORMALITY TEST.DATA1 produces this result:

P-Value: 0.953

A-Squared: 0.152Anderson-Darling Normality Test

N: 24

StDev: 5.26480Average: 100.082

11010090

.999

.99

.95

.80

.50

.20

.05

.01

.001

Probability

DATA1

Normal Probability Plot

The first test is to simply look at the data points and see if they fall close to the red line.

How close is close enough? We generally use the rule that if it can be covered by a fatpencil, it is good enough. In fact, if you have fewer than 15-20 data points, this is the

only test you need apply. P values are not too revealing for small samples. However, ifyou have many points, pay close attention to the P value. If it is .05 or more, you have noreason to believe the data is non-normal.

Another way to look for normality is to do STAT|BASIC STATS|DISPLAY

DESCRIPTIVE STATISTICS, and under GRAPHS, check GRAPHICAL SUMMARY.That will produce this result:

1101051009590

95% Confidence Interval for Mu

102.5101.5100.599.598.597.5

95% Confidence Interval for Median

Variable: DATA1

97.551

4.092

97.859

Maximum3rd QuartileMedian

1st QuartileMinimum

N

KurtosisSkewnessVarianceStDevMean

P-Value:

A-Squared:

101.980

7.385

102.306

112.583103.928100.344

96.66690.854

24

9.98E-040.34134427.7181

5.265100.082

0.953

0.152

95% Confidence Interval for Median

95% Confidence Interval for Sigma

95% Confidenc e Interval for Mu

Anderson-Darling Normality Test

Descriptive Statistics

Note that we get our same normality P value, .953, and that the computer draws its bestestimate of a normal curve that fits the data histogram. Dont be shocked if your data


9/18

9

looks a lot more ragged than this, but still tests normal. Small samples can look prettyRaggedy Andy, and still truly be normal.

Using the normal distribution superhighway has its distinct advantages. The toll is that

you should give at least a little attention to checking the assumptions. In practice, you

need not be overly concerned about normality if your samples are reasonably large. It isnot necessary for the distribution of data to be normal. It is only necessary that the

distribution of differences be normal, and, with decent size samples, that will be thecase.

Let us assume that we have been collecting data on a weight loss product. Two randomlyselected groups of volunteers are to be tested. Group A receives a placebo, and group B

receives Factor L. Both groups are weighed at the beginning of the study, and 90 dayslater, and the change in weight is calculated. For each volunteer we have a number that

represents weight gain. A gain of 8 would be a loss of 8 pounds. We have tested thedata for normality, and stability, and it is satisfactory. We select as our null hypothesis:

Weight gain does not depend on whether the volunteer received the placebo or Factor L.Our alternate hypothesis is that the mean gain for the control group is greater than themean gain for the test group.

The test is performed by invoking STATS|BASIC STATS|2-SAMPLE T.

At this point, we must confess that we have slipped a new idea into the mix. We havedeveloped a lot of our ideas based on the Gaussian distribution. It turns out there is

another distribution that is practically identical to the Gaussian distribution for sampleslarger than 30. However, this other distribution is more accurate than the Gaussian forsmaller sample sizes. This is the Students T distribution. We almost always use it, since

it applies everywhere the Gaussian distribution does, and some places that it does not.

Assume that my control group is in the column CONTROL, and that the test group is inthe column FACTORL. You would then choose STAT|BASIC STATS|2-SAMPLE T.From the dialog window choose SAMPLES IN DIFFERENT COLUMNS, and indicate

CONTROL as the first column and FACTORL as the second column. As yourALTERNATIVE, indicate GREATER THAN. This may seem a little backwards, but we

are hoping the weight gains in the first column are greater than the weight gains in thesecond column. Minitab is not specific about telling you how to do this, so you mustknow that it assumes the form COLUMN 1 [>,


10/18

10

FACTORL 24 -2.61 4.36 0.89

95% CI for mu CONTROL - mu FACTORL: ( 1.40, 5.85)

T-Test mu CONTROL = mu FACTORL (vs >): T = 3.29 P = 0.0010 DF = 42

FACTORLCONTROL

10

5

0

-5

-10

-15

Boxplots of CONTROL and FACTORL

(means are indicated by solid circles)

The interpretation of this is that our control group of 24 volunteers gained an average of

1.01 pounds, and that the standard deviation of the groups gain was 3.19 pounds. Ourtest volunteers gained an average of 2.61 pounds (2.61 pound loss), and the standard

deviation of the groups gain was 4.36 pounds. If we assert that the gain of the controlgroup is greater than the gain of the test group, there is a .0010 chance that we will bewrong. We therefore reject the null hypothesis, and accept the alternative hypothesis.

The box plot gives a nice visual to demonstrate what the statistics tell us.

It is by no means necessary to have equal sample sizes.

We did not check the ASSUME EQUAL VARIANCES box, because we have not taken

the step of demonstrating that the standard deviations (and hence the variances) of thetwo groups are approximately equal. If you want to run TEST OF EQUAL

VARIANCES on your data, you can earn the right to check this box. However, doingthis essentially relieves the computer of some of its calculation duties, and some expenseof your time, with negligible difference in results. If you are feeling compassionate

toward your overworked computer, you may want to take this route. I suggest justleaving the box unchecked.

The default alternative hypothesis is NOT EQUAL. We chose to do a one sided test,GREATER THAN, because it always gives the test more power. If we had chosen NOT

EQUAL, our P value would have been only .0020. In this case, we would still makethe same decision, but, still, we have doubled our P value. Frequently, this will be of

great importance. If the test results are interesting only if the results come out in oneparticular direction, then use the one sided test. That is most of the time.


11/18

11

ANOVA

ANOVA stands for Analysis of Variance. It is very much like the 2 Sample T Test, but

instead of having just two groups, you can have multiple groups. This is extremely

handy. The null hypothesis of ANOVA is that all the groups have the same mean. Also,it assumes that all the groups have roughly the same standard deviation. Before running

ANOVA, you should do Test of Equal Variances to ensure this. If LeVenes test comesout with a P of .05 or more, youre generally good to go.

Giving ANOVA such a short discussion shouldnt be interpreted as indicating that it isless useful. It is an extremely useful tool, but beyond the scope of a short paper. If you

understand the T Test, you should be able to move into simple ANOVA without muchdifficulty.

Power and Sample Size

DO NOT BEGIN AN EXPERIMENT THAT YOU ARE GOING TO ANALYZE WITHA T TEST OR A 2K FACTORIAL WITHOUT UNDERSTANDING THIS SECTION!

Risk, sample size, and the size of change you are trying to detect are three factors that areeternally at odds. It is rather like the old engineering maxim, Good, fast, cheap. Pick

any two. You can run a test for a small change, using a small sample size, but your riskof error will be high. You can run a test with low risk, trying to detect a small change,

but your sample size will be large.

A powerful test is one that is not highly influenced by random factors. The mathematical

definition is 1- . It is the probability that you will detect a difference, if, in fact, it exists.

If you properly use this function of Minitab, you will be able to predict the approximatechances of success of your experiment. You would be amazed how many tests are run

that have too few samples to have any reasonable chance of success. Conversely, youwould also be amazed at how many tests are much more expensive than necessary,because they have far too many samples.

If someone wants you to run a test that you can demonstrate has a power of .5 or less, I

suggest that you give them a quarter, and tell them to make their decision based onwhether it comes up heads or tails. If your test is no better than a quarter, dont botherrunning it. Get on to something that does have a good chance of success.

Lets suppose that you know that normal for Factor K in the general population

includes values from 90 to 140. Ben assures me that the medical standard for normal is95% of the population. Thats convenient, because 95% of the population occurs from 2standard deviations below the mean to 2 standard deviations above the mean. We know,

then, that 140 to 90 is four standard deviations, and can quickly deduce that one standarddeviation is about 12.5.


12/18

12

Now suppose that we have a new treatment that we think will beneficially increase Factor

K by 15 points. Go to STAT|POWER AND SAMPLE SIZE. Choose 2-SAMPLE T,then CALCULATE SAMPLE SIZE FOR EACH POWER VALUE. In POWER

VALUES, enter .95 .9 .87. In DIFFERENCE enter 15. In SIGMA, enter 12.5. Click

OPTIONS, and choose GREATER THAN as your alternative hypothesis. Minitab willreturn this:

2-Sample t Test

Testing mean 1 = mean 2 (versus >)

Calculating power for mean 1 = mean 2 + 15

Alpha = 0.05 Sigma = 12.5

Sample Target Actual

Size Power Power

16 0.9500 0.9527

13 0.9000 0.9077

12 0.8700 0.8854

In this case, if we want a power of at least .95, we need two samples of 16 each. For apower of .9, we need only 13 in each sample, and if we are willing to settle for only .87,

we need 12 in each sample.

Similarly, we can calculate our chances of success if sample size is dictated by budget orthe ever innumerate pointy haired boss.

2K

Factorial Experiments

These are wonderful.

When you to a T Test, or a single-factor ANOVA, you are changing one variable, and

studying the effect on your test subjects. This is a one factor at a time or OFATexperiment.

OFAT experiments are fine, but you have to realize that any factor that you dont accountfor shows up as noise, which makes your distributions fatter, and makes your effect

harder to see. That is, cholesterol levels depend a number of factors, such as diet, level ofexercise, and heredity. In an OFAT, you randomize to make sure that these variables do

indeed show up as noise. Would it not be better to make them part of the experiment?Then they can be made into signal instead of noise.

A 2K experiment allows you to do this. It also allows you to detect interactions betweenthe variables. The lovely thing is that you can do this for four or five variables, and still

not have to take many more observations.


13/18

13

For example, you can test two treatments simultaneously, and account for any differencesdue to gender, and you can test for all the interactions, all for the price of a T Test.

To set up your test, choose two to five variables you want to test. More than this can get

a little messy. Choose a high and a low value for each variable. It is better if these

are interval values, but they can be categorical/nominal. For example, if I want to test theeffect of ascorbic acid and aspirin on acidity, I might choose zero as my low dose for

both, and 500 mg as my high dose for both. I might want to test subject age as my othervariable. I could pick 20-25 as the low value for my age variable and 55-60 as the high

value.

My output will be some very scientifically arrived at acidity scale, which extends from 0

to 100.

For a simple demonstration, lets do this with a single replicate, and no center points.Well add these embellishments in class.

In Minitab, go to STAT|DOE|CREATE FACTORIAL DESIGN. Choose 2 LEVELFACTORIAL, with 3 as your number of variables. Under DESIGNS, choose FULL

FACTORIAL, with 0 center points, 1 replicate, and 1 block. Click OK, then go toFACTORS, and in the NAME column, put ascorbic acid next to A, aspirin next toB and age next to C. Your low value will automatically be coded as a 1, and your high

value will automatically be coded as a 1. Yes, you can change this, but wait until youunderstand a little better.

Click OK, and OK again, and Minitab will design your experiment for you. It willautomatically randomize your run order for you. (Actually, with a single replicate,

randomization buys you nothing, so we would have been justified in not randomizing thistime. However, we will shortly drop a variable, effectively creating a second replicate,

and, at that point, randomization starts to have some use.) Your design will look differentfrom mine, because your random number generator spat out a different sequence thanmine did---well, we hope so, or they are not so random. Anyway, mine looks like this:

StdOrder RunOrde

r

CenterPt Blocks ascorbic

acid

aspirin age acidity

2 1 1 1 1 -1 -1

3 2 1 1 -1 1 -1

8 3 1 1 1 1 1

1 4 1 1 -1 -1 -14 5 1 1 1 1 -1

6 6 1 1 1 -1 1

7 7 1 1 -1 1 1

5 8 1 1 -1 -1 1

This instructs me to find a young subject, give him a high dose of ascorbic acid, a low

dose of aspirin, and note the resulting acidity. I then continue on down the list. One


14/18

14

missing data point is not necessarily the kiss of death when you have multiple replicates(sets of data), but you do want to try very hard to get complete sets.

In my table, I created a made-up set of results, and Minitab obligingly spits out the

following interesting output:

Fractional Factorial Fit

Estimated Effects and Coefficients for acidity (coded units)

Term Effect Coef

Constant 43.7500

aspirin 57.5000 28.7500

ascorbic 27.5000 13.7500

age 10.0000 5.0000

aspirin*ascorbic 7.5000 3.7500

aspirin*age 0.0000 0.0000

ascorbic*age -0.0000 -0.0000

aspirin*ascorbic*age 0.0000 0.0000Analysis of Variance for acidity (coded units)

Source DF Seq SS Adj SS Adj MS F P

Main Effects 3 8325.00 8325.00 2775.00 * *2-Way Interactions 3 112.50 112.50 37.50 * *

3-Way Interactions 1 0.00 0.00 0.00 * *

Residual Error 0 0.00 0.00 0.00

Total 7 8437.50

The interpretation of this output is that going from the low state of aspirin to the high

state accounts for 57.5 units of the observed change in acidity. Similarly, ascorbic acid is

0 10 20 30 40 50

ABC

AC

BC

AB

C

B

A

Pareto Chart of the Effects

(response i s acid ity, Alpha = .10)

A: aspi rinB: ascorbicC: age


15/18

15

responsible for 27.5 units of change. Age, and an interaction between aspirin andascorbic acid account for lesser amounts of change, and the model predicts that these last

two are not statistically significant. We shall see more about this as we add replicates.

Since age is not indicated as a significant variable, we can gain more information about

the important variables by eliminating this unimportant one.

By going to STAT|DOE|ANALYZE FACTORIAL DESIGN|TERMS, we can eliminateall terms except aspirin and ascorbic acid. Since we have only 2 main terms, only 4 runs

are required per replicate. But note that we still have 8 observations. The computer willrecognize this, and turn our 4 extra observations into a replicate, giving us two replicates.The output now becomes quite a bit stronger.

Minitab does do a quick shuffle on us, without telling us---with a single replicate, it

reports the magnitude of the effect. When we get two replicates, it reports a T value.The output looks like this:

Estimated Effects and Coefficients for acidity (coded units)

Term Effect Coef StDev Coef T P

Constant 43.75 2.795 15.65 0.000aspirin 57.50 28.75 2.795 10.29 0.000

ascorbic 27.50 13.75 2.795 4.92 0.004

Analysis of Variance for acidity (coded units)


Main Effects 2 8125.0 8125.0 4062.50 65.00 0.000

Residual Error 5 312.5 312.5 62.50

Lack of Fit 1 112.5 112.5 112.50 2.25 0.208

Pure Error 4 200.0 200.0 50.00

0 5 10

ascorbic

aspirin

Pareto Chart of the Standardized Effects

(response is acidi ty, Alpha = .10)


16/18

16

Total 7 8437.5

We now get F and P values, and note that we still have about the same estimates for the

constant (average acidity with no treatment), aspirin, and ascorbic acid. We also see thatour model accounts for 8125/8437.5 of the variation, with 312.5/8437.5 of the variation

being attributed to noise (variables we did not account for).

We have less than one chance per thousand (.001) of being wrong if we assert that aspirin

and ascorbic acid increase acidity. We also know the magnitude of change that eachvariable causes. Thats a pretty substantial model, for only 8 observations.

Even if we cannot reduce the model by dropping one variable, a simple 8 observationexperiment for three variables can be very revealing, especially if we are in the screening

stages.

The output of the test assumes a linear model. If we want to test the validity of thisassumption, we can add center points. Three per replicate is pretty normal.

Using Power and Sample Size, we find that if we want to be .95 sure of catching a 15point shift in a process that normally runs with a standard deviation of 15, and run no

more than a .05 alpha risk, we need 7 replicates, or 56 total observations. For 56observations, we get full evaluation of three variables and all their interactions. A T Testto detect a single variable on this level requires 54 observations. Two more observations

give us a lot more information. A four variable experiment with these objectives requiresonly 64 observations. Actually, the factorial experiment is much more likely to succeed

than the T Test, besides being much more informative.

Why should this be so? Suppose we are testing blood cholesterol levels, and our

variables are Medicine X, gender, and exercise level. If we test Medicine X vs. placeboon the general population, we will have relatively broad distributions, since we randomly

select males, females, couch potatoes and athletes to ensure that we represent the wholepopulation. In a 2K, our distributions will have relatively smaller standard deviations,since we will be considering more restricted groups, made up of males who exercise,

male couch potatoes, female athletes, and female couch potatoes. These are more selectgroups. We have taken some of the variation out, and made it into variables which we

test. Skinnier distributions make it easier to detect smaller changes, so our chances ofsuccess are better than they would be with a T Test.

To demonstrate the effect of replicates, the same data was copied into an experiment with

the same variables, and 7 replicates instead of 1. Random numbers with a mean of 0 anda standard deviation of 10 were added to the data, to simulate the effects of unaccountedfor variables and measurement system error. Our results are:


17/18

17

Fractional Factorial Fit

Estimated Effects and Coefficients for acid2 (coded units)

Term Effect Coef StDev Coef T P

Constant 42.2642 1.337 31.62 0.000

aspirin 59.4971 29.7485 1.337 22.26 0.000

ascorbic 28.5346 14.2673 1.337 10.67 0.000

age 14.8242 7.4121 1.337 5.55 0.000

aspirin*ascorbic 14.2761 7.1380 1.337 5.34 0.000

aspirin*age -0.0245 -0.0123 1.337 -0.01 0.993

ascorbic*age 0.5717 0.2859 1.337 0.21 0.832

aspirin*ascorbic*age 1.5459 0.7730 1.337 0.58 0.566

Analysis of Variance for acid2 (coded units)


Main Effects 3 64034.4 64034.4 21344.8 213.34 0.000

2-Way Interactions 3 2857.9 2857.9 952.6 9.52 0.000

3-Way Interactions 1 33.5 33.5 33.5 0.33 0.566

Residual Error 48 4802.5 4802.5 100.1

Pure Error 48 4802.5 4802.5 100.1

Total 55 71728.2

Unusual Observations for acid2

Obs acid2 Fit StDev Fit Residual St Resid

11 34.105 12.708 3.781 21.398 2.31R

44 103.573 84.959 3.781 18.614 2.01R

55 47.981 26.582 3.781 21.399 2.31R

R denotes an observation with a large standardized residual

0 10 20

AC

BC

ABC

AB

C

B

A

Pareto Chart of the Standardized Effects(response i s acid2 , Alph a = .10)

A: aspirinB: ascorbic

C: a ge

-15 -10 -5 0 5 10 15 200

5

10

Frequency

Histogram of Residuals

0 10 20 30 40 50 60

-40

-30

-20-10

0

10

20

30

40

Observation Number

Re

sidual

I Chart of Res iduals

22

X=0.000

3.0SL=30.76

-3.0SL=-30.76

0 10 20 30 40 50 60 70 80 90 100

-10

0

10

20

Residual

Residual s vs. Fits

-2 -1 0 1 2

-10

0

10

20

Normal Plot of Residuals

Normal Score

R

esidual

acid2

-1

1

-1-1 11

5

15

25

35

45

55

65

75

85

95

ascorbic aci

aspirin

Mean

Interaction Plot (data means) for acid2


18/18

18

With 7 replicates, we can see more clearly, and all the variables have statistical

significance. We can also see that there is a significant interaction between aspirin andascorbic acid. This is different from the two producing a simple additive result. Thecombination of aspirin and ascorbic acid being in their high state produces a stronger

result than just adding the two results together.

We have also performed an analysis of our residuals. This is something you shouldalways do when you have replicates. We want to see residuals that are normallydistributed, time stable, and that are fairly equally spaced about zero in the residuals vs.

fits graph. These residuals arent perfect, but they are quite adequate.

We have nicely recovered our factors, too, even in the face of a fair amount of deliberatestatistical noise. The recovered factors correspond well with the ones used to generatethe outcomes.

This class of experimental designs has some tremendous benefits:

1. It is vastly more efficient that single factor designs. For the price of a T test,which provides information on a single variable, we get five variables and all their

possible interactions. For six or more variables, it becomes more efficient to useadvanced versions of this design, which are beyond the scope of this paper.

2. Single factor designs will discover interactions only with the greatest of labor andsuccessful intuition. 2K designs discover and characterize them with great ease.

3. 2K designs are more likely to discover effects than simple T tests are. Factors thatcloud a T test with noise can be made into input variables in this type of test, andthat removes their ability to conceal effects.

Promontory Management Group, Inc.

www.pmg.cc801 710 5645

This document provided free of charge, for the useof the person who requested it. It may not bereproduced, distributed, or incorporated in a

course of study without written permission.

beginers guide to statistics

Documents