beginers guide to statistics
TRANSCRIPT
-
8/6/2019 Beginers Guide to Statistics
1/18
1
A Beginners Guide to Statistical Testing Denton Bramwell, Nov 15, 1999
Promontory Management Group Inc., [email protected]
Statistics was invented as an occupation in order to provide employment for those who do
not have enough personality to go into accounting. I guess that explains why I took up
Six Sigma, with its concomitant involvement in statistics.
In this short paper, Ill explain in the simplest way I can the practical fundamentals of afew key statistical tests. The statistical software that I use is Minitab version 12. [Note:
Version 13 has since been released.]
Data Types
There are four types of data. It is important to know which kind of data you are dealing
with, since this determines the types of test you can do. The four types are
Nominal or categorical: Classes things into categories, such as pass/fail, old/new,
red/blue/green, or died/survived. Ordinal: Ranks things. Team B is stronger than C, which, is in turn stronger than
A.
Interval: Things you can add and subtract, like degrees Celsius or Fahrenheit.Interval data can be continuous or discrete. Continuous data can take on anyvalue, such as the actual air pressure inside a tire. Discrete data comes in steps,like money. In the US, the smallest increment of money is one cent, so
expressions of money normally come in steps of one cent.
Ratio: Similar to interval data, but zero indicates a total absence of a property.Temperature expressed in Kelvins is ratio data. Like interval data, ratio data maybe continuous or discrete.
Nominal data yields the least information. Ordinal data is better. Interval or ratio data isbest.
Suppose you have been told that you will be taking a long trip, to 10 differentdestinations, somewhere in the world. If you are told only that five are in the northern
hemisphere, and that five are in the southern hemisphere, you have no idea what kind ofclothes to pack. That is because you have been given categorical data, and categorical
data conveys the least information of any of the data types. On the other hand, if you aregiven the exact coordinates of your 10 destinations, you will know exactly what to pack.Expressing positions on the Earths surface requires ratio data, and ratio data conveys the
most information. If possible, use interval or ratio data in conducting your investigations.
Basics of Hypothesis Testing
The null hypothesis, Ho, is the dull hypothesis: Nothing interesting is going on. For
example, it makes no difference whether people receive an antibiotic or a placebo.
-
8/6/2019 Beginers Guide to Statistics
2/18
-
8/6/2019 Beginers Guide to Statistics
3/18
3
The Chi Square Test
This is a test using nominal data. It makes very few assumptions, so it can be used whereother tests are useless. The bad news is that it can require a lot of data, and the results are
sometimes hard to interpret.
This test is performed on data organized into rows and columns. For example, we might
want to test survival of patients who are given one of three types of care in a lifethreatening situation. The data array might look like
Treatment 1 Treatment 2 Treatment 3
Survived 21 40 19
Died 9 2 11
The null hypothesis of the Chi Square test is that the rows and columns are statistically
independent. That is, if I know which column a case is in, I cannot make much betterthan a random guess as to which row it is from, or vice versa.
Lets now go to Minitab and perform the test. Enter the data table, and choose
STATS|TABLES|CHI SQUARE. You will obtain this report:
1 21 40 19 80
23.53 32.94 23.53
2 9 2 11 22
6.47 9.06 6.47
Total 30 42 30 102
Chi-Sq = 0.272 + 1.513 + 0.872 +
0.989 + 5.500 + 3.171 = 12.316
DF = 2, P-Value = 0.002
You can see that our original input is printed, with row and column sums. Beneath eachof the original entries is an expected value for each cell. These are the numbers 23.53,
32.94, and so on. If any of these values is below 5, it means that you have not collectedenough data to run this test, and cannot depend on the result. In this case, our lowest cell
has an expected value of 6.47, so we are good to go.
The interpretation of this test lies in the P value, which is our risk, or probability ofbeing wrong if we assert there is a difference. It says that we would run a .2% chance ofbeing wrong if we asserted that the treatments produced different survival rates. Thats a
strong outcome. The weakness of the test is that all you really statistically know is thatthey are not the same. You dont officially know which treatment is better or worse.
-
8/6/2019 Beginers Guide to Statistics
4/18
4
The Tukey Tail Count
The B vs. C, or Tukey Tail Test is extremely easy to use, and reasonably powerful. It isnonparametric. That means that it does not depend on the data being normally
distributed, so you can use this test when you have really ugly data, and still get
beautiful results. Another good feature of the B vs. C test is that strong statisticalindications can be found with very small amounts of data. This test uses ordinal data.
C represents one process that we want to evaluate, usually the current process that we
want to replace. B represents the other process, usually the better process that we wantto put in place. John Tukey is the statistician that developed and popularized the test thatB vs. C is based on. The data analysis is based on counting Bs and Cs on the ends, or
tails of a distribution. Thats where the name Tukey Tail Test comes from.
This test absolutely requires randomization and/or blocking. It is tempting to just run afew Cs, then switch over and run a few Bs, and make a decision. This is an invitation to
error. You must either block or randomize such variables as age, gender, or other factorsthat might influence the outcome.
Suppose that you randomly select just 3 items from your C process, and 3 items fromyour B process, and that all 3 of your Bs were better than your Cs. Intuitively, youdprobably feel that you were on to something, and that your B process was indeed better.
Statistics support your intuition.
There are just 20 different orders you can put 3 Bs and 3 Cs into (Try it!). Only one ofthose is 3 Bs above 3 Cs. You have just one chance in 20, or .05 probability of arrivingat such an arrangement by chance. So you if all 3 Bs are better than all 3 Cs, you can
be 95% confident that B is indeed better than C, and you have arrived at this with onlysix samples.
One other nicety is that this works for cosmetic issues. You dont really have to be ablemeasure goodness. You just have to be able to rank your Bs and Cs from highest to
lowest. So if you want to test the attractiveness of men/women in the fashion modelingbusiness vs. the attractiveness of men/women in med school, this is your test, assuming
you can put all of them in rank order of attractiveness.
The following table shows the number of Bs and Cs you need, for a given level of
risk. Note that for each level of, there are several acceptable combinations of B and C
sample sizes. Remember, you must randomize your selections. Also remember, all yourBs must outrank all your Cs. Note that the table includes our 3 B vs. 3 C example.
-
8/6/2019 Beginers Guide to Statistics
5/18
5
Risk Number of Bsamples
Number of C
samples
2 43
3 16
4 105 8
.001
6 6
2 13
3 7
4 5
.01
5 4
1 19
2 5
3 3
.05
4 3
1 92 3.1
3 2
There is another variation on this test that is useful, and in many cases preferred to thesystem already shown. In this system, all the Bs do not need to outrank all the Cs. It
does require that the number of B samples and the number of C samples must beapproximately equal. If the ratio between the size of the B sample and the C sample falls
in the range of 3:4 to 4:3, the sample sizes are near enough to being equal. This test alsoabsolutely requires randomization.
Suppose that you draw a random sample of 6 Bs and 6 Cs, and that you rank them inorder of goodness. Your distribution might look like this:
BB
BB
CB
C
CB
CC
C
Best
Worst
Pure Bs. This is yourB end count.
Mixed Bs and Cs.
Disregard.
Pure Cs. This is yourC end count.
-
8/6/2019 Beginers Guide to Statistics
6/18
6
Now just add your B end count to your C end count, and refer to this chart for your levelof significance.
Risk B+C End CountAt Least
.1 6.05 7
.01 10
.001 13
Since our total end count is 7, we can be 95% confident that B is better than C.
The Two Sample T Test
This is a powerful test, using interval or ratio data. Its null hypothesis is that the meansof two normally distributed populations are the same. The alternate hypothesis can be
that they are not equal, or that the mean of A is greater/less than the mean of B.
This test requires that fundamental assumptions are met.
The data are normally distributed. A normal distribution is also called a Gaussiandistribution. Actually the test is pretty robust, and will still give good results withpretty non-normal data.
The data are stable during the sampling period. This assumption is veryimportant.
You must demonstrate that you meet these assumptions if you want to cruise down the
Gaussian superhighway. Fortunately, this is easy.
0 5 10 15 20 25
80
90
100
110
120
Observation Number
IndividualValue
I Chart for DATA1
X=100.1
3.0SL=117.6
-3.0SL=82.61
First, lets check stability for DATA1. There are several ways to do this, but my favoriteis the Individuals Control Chart. This is run by clicking STATS|CONTROL
-
8/6/2019 Beginers Guide to Statistics
7/18
7
CHARTS|INDIVIDUALS. The basic thing were doing is verifying that there are notrends or sudden, major shifts or trends in the data (stability). This chart spreads the data
out in time, and allows us to inspect it. DATA1 is just fine.
DATA2 is not so fine. I see an upward trend in the first dozen data points, followed by a
sudden downward shift, then a sudden upward shift at about data point 19. This datadoes not meet our assumption of stability.
2520151050
130
120
110
100
90
80
Observation Number
IndividualValue
I Chart for DATA2
1
X=106.6
3.0SL=129.0
-3.0SL=84.29
151050
30
20
10
0
-10
-20
-30
Observation Number
IndividualValue
I Chart for DATA3
X=0.6667
3.0SL=25.36
-3.0SL=-24.03
DATA3 has a discrimination problem. The only values that the data can take on are 10,-5, 0, 5, and 10. Since the data can only exist in one of five states, it will not pass a
normality test. Ive researched the matter, and have not been able to get an accuratestatement of how much discrimination is enough. My opinion is that you should see atleast 10 different values in the data, and there is some reasoning behind this opinion,
which we dont have space to discuss here.
We have now concluded that DATA2 and DATA3 are not prime candidates for the
Gaussian./normal model. This does not mean that we absolutely cannot apply the model,but it does mean that if we do apply it, we use the results with appropriate caution.
-
8/6/2019 Beginers Guide to Statistics
8/18
8
DATA1 has met both of our tests so far, and requires only one more test. That is for
normality. This test is done by clicking STAT|BASIC STATS|NORMALITY TEST.DATA1 produces this result:
P-Value: 0.953
A-Squared: 0.152Anderson-Darling Normality Test
N: 24
StDev: 5.26480Average: 100.082
11010090
.999
.99
.95
.80
.50
.20
.05
.01
.001
Probability
DATA1
Normal Probability Plot
The first test is to simply look at the data points and see if they fall close to the red line.
How close is close enough? We generally use the rule that if it can be covered by a fatpencil, it is good enough. In fact, if you have fewer than 15-20 data points, this is the
only test you need apply. P values are not too revealing for small samples. However, ifyou have many points, pay close attention to the P value. If it is .05 or more, you have noreason to believe the data is non-normal.
Another way to look for normality is to do STAT|BASIC STATS|DISPLAY
DESCRIPTIVE STATISTICS, and under GRAPHS, check GRAPHICAL SUMMARY.That will produce this result:
1101051009590
95% Confidence Interval for Mu
102.5101.5100.599.598.597.5
95% Confidence Interval for Median
Variable: DATA1
97.551
4.092
97.859
Maximum3rd QuartileMedian
1st QuartileMinimum
N
KurtosisSkewnessVarianceStDevMean
P-Value:
A-Squared:
101.980
7.385
102.306
112.583103.928100.344
96.66690.854
24
9.98E-040.34134427.7181
5.265100.082
0.953
0.152
95% Confidence Interval for Median
95% Confidence Interval for Sigma
95% Confidenc e Interval for Mu
Anderson-Darling Normality Test
Descriptive Statistics
Note that we get our same normality P value, .953, and that the computer draws its bestestimate of a normal curve that fits the data histogram. Dont be shocked if your data
-
8/6/2019 Beginers Guide to Statistics
9/18
9
looks a lot more ragged than this, but still tests normal. Small samples can look prettyRaggedy Andy, and still truly be normal.
Using the normal distribution superhighway has its distinct advantages. The toll is that
you should give at least a little attention to checking the assumptions. In practice, you
need not be overly concerned about normality if your samples are reasonably large. It isnot necessary for the distribution of data to be normal. It is only necessary that the
distribution of differences be normal, and, with decent size samples, that will be thecase.
Let us assume that we have been collecting data on a weight loss product. Two randomlyselected groups of volunteers are to be tested. Group A receives a placebo, and group B
receives Factor L. Both groups are weighed at the beginning of the study, and 90 dayslater, and the change in weight is calculated. For each volunteer we have a number that
represents weight gain. A gain of 8 would be a loss of 8 pounds. We have tested thedata for normality, and stability, and it is satisfactory. We select as our null hypothesis:
Weight gain does not depend on whether the volunteer received the placebo or Factor L.Our alternate hypothesis is that the mean gain for the control group is greater than themean gain for the test group.
The test is performed by invoking STATS|BASIC STATS|2-SAMPLE T.
At this point, we must confess that we have slipped a new idea into the mix. We havedeveloped a lot of our ideas based on the Gaussian distribution. It turns out there is
another distribution that is practically identical to the Gaussian distribution for sampleslarger than 30. However, this other distribution is more accurate than the Gaussian forsmaller sample sizes. This is the Students T distribution. We almost always use it, since
it applies everywhere the Gaussian distribution does, and some places that it does not.
Assume that my control group is in the column CONTROL, and that the test group is inthe column FACTORL. You would then choose STAT|BASIC STATS|2-SAMPLE T.From the dialog window choose SAMPLES IN DIFFERENT COLUMNS, and indicate
CONTROL as the first column and FACTORL as the second column. As yourALTERNATIVE, indicate GREATER THAN. This may seem a little backwards, but we
are hoping the weight gains in the first column are greater than the weight gains in thesecond column. Minitab is not specific about telling you how to do this, so you mustknow that it assumes the form COLUMN 1 [>,
-
8/6/2019 Beginers Guide to Statistics
10/18
10
FACTORL 24 -2.61 4.36 0.89
95% CI for mu CONTROL - mu FACTORL: ( 1.40, 5.85)
T-Test mu CONTROL = mu FACTORL (vs >): T = 3.29 P = 0.0010 DF = 42
FACTORLCONTROL
10
5
0
-5
-10
-15
Boxplots of CONTROL and FACTORL
(means are indicated by solid circles)
The interpretation of this is that our control group of 24 volunteers gained an average of
1.01 pounds, and that the standard deviation of the groups gain was 3.19 pounds. Ourtest volunteers gained an average of 2.61 pounds (2.61 pound loss), and the standard
deviation of the groups gain was 4.36 pounds. If we assert that the gain of the controlgroup is greater than the gain of the test group, there is a .0010 chance that we will bewrong. We therefore reject the null hypothesis, and accept the alternative hypothesis.
The box plot gives a nice visual to demonstrate what the statistics tell us.
It is by no means necessary to have equal sample sizes.
We did not check the ASSUME EQUAL VARIANCES box, because we have not taken
the step of demonstrating that the standard deviations (and hence the variances) of thetwo groups are approximately equal. If you want to run TEST OF EQUAL
VARIANCES on your data, you can earn the right to check this box. However, doingthis essentially relieves the computer of some of its calculation duties, and some expenseof your time, with negligible difference in results. If you are feeling compassionate
toward your overworked computer, you may want to take this route. I suggest justleaving the box unchecked.
The default alternative hypothesis is NOT EQUAL. We chose to do a one sided test,GREATER THAN, because it always gives the test more power. If we had chosen NOT
EQUAL, our P value would have been only .0020. In this case, we would still makethe same decision, but, still, we have doubled our P value. Frequently, this will be of
great importance. If the test results are interesting only if the results come out in oneparticular direction, then use the one sided test. That is most of the time.
-
8/6/2019 Beginers Guide to Statistics
11/18
11
ANOVA
ANOVA stands for Analysis of Variance. It is very much like the 2 Sample T Test, but
instead of having just two groups, you can have multiple groups. This is extremely
handy. The null hypothesis of ANOVA is that all the groups have the same mean. Also,it assumes that all the groups have roughly the same standard deviation. Before running
ANOVA, you should do Test of Equal Variances to ensure this. If LeVenes test comesout with a P of .05 or more, youre generally good to go.
Giving ANOVA such a short discussion shouldnt be interpreted as indicating that it isless useful. It is an extremely useful tool, but beyond the scope of a short paper. If you
understand the T Test, you should be able to move into simple ANOVA without muchdifficulty.
Power and Sample Size
DO NOT BEGIN AN EXPERIMENT THAT YOU ARE GOING TO ANALYZE WITHA T TEST OR A 2K FACTORIAL WITHOUT UNDERSTANDING THIS SECTION!
Risk, sample size, and the size of change you are trying to detect are three factors that areeternally at odds. It is rather like the old engineering maxim, Good, fast, cheap. Pick
any two. You can run a test for a small change, using a small sample size, but your riskof error will be high. You can run a test with low risk, trying to detect a small change,
but your sample size will be large.
A powerful test is one that is not highly influenced by random factors. The mathematical
definition is 1- . It is the probability that you will detect a difference, if, in fact, it exists.
If you properly use this function of Minitab, you will be able to predict the approximatechances of success of your experiment. You would be amazed how many tests are run
that have too few samples to have any reasonable chance of success. Conversely, youwould also be amazed at how many tests are much more expensive than necessary,because they have far too many samples.
If someone wants you to run a test that you can demonstrate has a power of .5 or less, I
suggest that you give them a quarter, and tell them to make their decision based onwhether it comes up heads or tails. If your test is no better than a quarter, dont botherrunning it. Get on to something that does have a good chance of success.
Lets suppose that you know that normal for Factor K in the general population
includes values from 90 to 140. Ben assures me that the medical standard for normal is95% of the population. Thats convenient, because 95% of the population occurs from 2standard deviations below the mean to 2 standard deviations above the mean. We know,
then, that 140 to 90 is four standard deviations, and can quickly deduce that one standarddeviation is about 12.5.
-
8/6/2019 Beginers Guide to Statistics
12/18
12
Now suppose that we have a new treatment that we think will beneficially increase Factor
K by 15 points. Go to STAT|POWER AND SAMPLE SIZE. Choose 2-SAMPLE T,then CALCULATE SAMPLE SIZE FOR EACH POWER VALUE. In POWER
VALUES, enter .95 .9 .87. In DIFFERENCE enter 15. In SIGMA, enter 12.5. Click
OPTIONS, and choose GREATER THAN as your alternative hypothesis. Minitab willreturn this:
2-Sample t Test
Testing mean 1 = mean 2 (versus >)
Calculating power for mean 1 = mean 2 + 15
Alpha = 0.05 Sigma = 12.5
Sample Target Actual
Size Power Power
16 0.9500 0.9527
13 0.9000 0.9077
12 0.8700 0.8854
In this case, if we want a power of at least .95, we need two samples of 16 each. For apower of .9, we need only 13 in each sample, and if we are willing to settle for only .87,
we need 12 in each sample.
Similarly, we can calculate our chances of success if sample size is dictated by budget orthe ever innumerate pointy haired boss.
2K
Factorial Experiments
These are wonderful.
When you to a T Test, or a single-factor ANOVA, you are changing one variable, and
studying the effect on your test subjects. This is a one factor at a time or OFATexperiment.
OFAT experiments are fine, but you have to realize that any factor that you dont accountfor shows up as noise, which makes your distributions fatter, and makes your effect
harder to see. That is, cholesterol levels depend a number of factors, such as diet, level ofexercise, and heredity. In an OFAT, you randomize to make sure that these variables do
indeed show up as noise. Would it not be better to make them part of the experiment?Then they can be made into signal instead of noise.
A 2K experiment allows you to do this. It also allows you to detect interactions betweenthe variables. The lovely thing is that you can do this for four or five variables, and still
not have to take many more observations.
-
8/6/2019 Beginers Guide to Statistics
13/18
13
For example, you can test two treatments simultaneously, and account for any differencesdue to gender, and you can test for all the interactions, all for the price of a T Test.
To set up your test, choose two to five variables you want to test. More than this can get
a little messy. Choose a high and a low value for each variable. It is better if these
are interval values, but they can be categorical/nominal. For example, if I want to test theeffect of ascorbic acid and aspirin on acidity, I might choose zero as my low dose for
both, and 500 mg as my high dose for both. I might want to test subject age as my othervariable. I could pick 20-25 as the low value for my age variable and 55-60 as the high
value.
My output will be some very scientifically arrived at acidity scale, which extends from 0
to 100.
For a simple demonstration, lets do this with a single replicate, and no center points.Well add these embellishments in class.
In Minitab, go to STAT|DOE|CREATE FACTORIAL DESIGN. Choose 2 LEVELFACTORIAL, with 3 as your number of variables. Under DESIGNS, choose FULL
FACTORIAL, with 0 center points, 1 replicate, and 1 block. Click OK, then go toFACTORS, and in the NAME column, put ascorbic acid next to A, aspirin next toB and age next to C. Your low value will automatically be coded as a 1, and your high
value will automatically be coded as a 1. Yes, you can change this, but wait until youunderstand a little better.
Click OK, and OK again, and Minitab will design your experiment for you. It willautomatically randomize your run order for you. (Actually, with a single replicate,
randomization buys you nothing, so we would have been justified in not randomizing thistime. However, we will shortly drop a variable, effectively creating a second replicate,
and, at that point, randomization starts to have some use.) Your design will look differentfrom mine, because your random number generator spat out a different sequence thanmine did---well, we hope so, or they are not so random. Anyway, mine looks like this:
StdOrder RunOrde
r
CenterPt Blocks ascorbic
acid
aspirin age acidity
2 1 1 1 1 -1 -1
3 2 1 1 -1 1 -1
8 3 1 1 1 1 1
1 4 1 1 -1 -1 -14 5 1 1 1 1 -1
6 6 1 1 1 -1 1
7 7 1 1 -1 1 1
5 8 1 1 -1 -1 1
This instructs me to find a young subject, give him a high dose of ascorbic acid, a low
dose of aspirin, and note the resulting acidity. I then continue on down the list. One
-
8/6/2019 Beginers Guide to Statistics
14/18
14
missing data point is not necessarily the kiss of death when you have multiple replicates(sets of data), but you do want to try very hard to get complete sets.
In my table, I created a made-up set of results, and Minitab obligingly spits out the
following interesting output:
Fractional Factorial Fit
Estimated Effects and Coefficients for acidity (coded units)
Term Effect Coef
Constant 43.7500
aspirin 57.5000 28.7500
ascorbic 27.5000 13.7500
age 10.0000 5.0000
aspirin*ascorbic 7.5000 3.7500
aspirin*age 0.0000 0.0000
ascorbic*age -0.0000 -0.0000
aspirin*ascorbic*age 0.0000 0.0000Analysis of Variance for acidity (coded units)
Source DF Seq SS Adj SS Adj MS F P
Main Effects 3 8325.00 8325.00 2775.00 * *2-Way Interactions 3 112.50 112.50 37.50 * *
3-Way Interactions 1 0.00 0.00 0.00 * *
Residual Error 0 0.00 0.00 0.00
Total 7 8437.50
The interpretation of this output is that going from the low state of aspirin to the high
state accounts for 57.5 units of the observed change in acidity. Similarly, ascorbic acid is
0 10 20 30 40 50
ABC
AC
BC
AB
C
B
A
Pareto Chart of the Effects
(response i s acid ity, Alpha = .10)
A: aspi rinB: ascorbicC: age
-
8/6/2019 Beginers Guide to Statistics
15/18
15
responsible for 27.5 units of change. Age, and an interaction between aspirin andascorbic acid account for lesser amounts of change, and the model predicts that these last
two are not statistically significant. We shall see more about this as we add replicates.
Since age is not indicated as a significant variable, we can gain more information about
the important variables by eliminating this unimportant one.
By going to STAT|DOE|ANALYZE FACTORIAL DESIGN|TERMS, we can eliminateall terms except aspirin and ascorbic acid. Since we have only 2 main terms, only 4 runs
are required per replicate. But note that we still have 8 observations. The computer willrecognize this, and turn our 4 extra observations into a replicate, giving us two replicates.The output now becomes quite a bit stronger.
Minitab does do a quick shuffle on us, without telling us---with a single replicate, it
reports the magnitude of the effect. When we get two replicates, it reports a T value.The output looks like this:
Estimated Effects and Coefficients for acidity (coded units)
Term Effect Coef StDev Coef T P
Constant 43.75 2.795 15.65 0.000aspirin 57.50 28.75 2.795 10.29 0.000
ascorbic 27.50 13.75 2.795 4.92 0.004
Analysis of Variance for acidity (coded units)
Source DF Seq SS Adj SS Adj MS F P
Main Effects 2 8125.0 8125.0 4062.50 65.00 0.000
Residual Error 5 312.5 312.5 62.50
Lack of Fit 1 112.5 112.5 112.50 2.25 0.208
Pure Error 4 200.0 200.0 50.00
0 5 10
ascorbic
aspirin
Pareto Chart of the Standardized Effects
(response is acidi ty, Alpha = .10)
-
8/6/2019 Beginers Guide to Statistics
16/18
16
Total 7 8437.5
We now get F and P values, and note that we still have about the same estimates for the
constant (average acidity with no treatment), aspirin, and ascorbic acid. We also see thatour model accounts for 8125/8437.5 of the variation, with 312.5/8437.5 of the variation
being attributed to noise (variables we did not account for).
We have less than one chance per thousand (.001) of being wrong if we assert that aspirin
and ascorbic acid increase acidity. We also know the magnitude of change that eachvariable causes. Thats a pretty substantial model, for only 8 observations.
Even if we cannot reduce the model by dropping one variable, a simple 8 observationexperiment for three variables can be very revealing, especially if we are in the screening
stages.
The output of the test assumes a linear model. If we want to test the validity of thisassumption, we can add center points. Three per replicate is pretty normal.
Using Power and Sample Size, we find that if we want to be .95 sure of catching a 15point shift in a process that normally runs with a standard deviation of 15, and run no
more than a .05 alpha risk, we need 7 replicates, or 56 total observations. For 56observations, we get full evaluation of three variables and all their interactions. A T Testto detect a single variable on this level requires 54 observations. Two more observations
give us a lot more information. A four variable experiment with these objectives requiresonly 64 observations. Actually, the factorial experiment is much more likely to succeed
than the T Test, besides being much more informative.
Why should this be so? Suppose we are testing blood cholesterol levels, and our
variables are Medicine X, gender, and exercise level. If we test Medicine X vs. placeboon the general population, we will have relatively broad distributions, since we randomly
select males, females, couch potatoes and athletes to ensure that we represent the wholepopulation. In a 2K, our distributions will have relatively smaller standard deviations,since we will be considering more restricted groups, made up of males who exercise,
male couch potatoes, female athletes, and female couch potatoes. These are more selectgroups. We have taken some of the variation out, and made it into variables which we
test. Skinnier distributions make it easier to detect smaller changes, so our chances ofsuccess are better than they would be with a T Test.
To demonstrate the effect of replicates, the same data was copied into an experiment with
the same variables, and 7 replicates instead of 1. Random numbers with a mean of 0 anda standard deviation of 10 were added to the data, to simulate the effects of unaccountedfor variables and measurement system error. Our results are:
-
8/6/2019 Beginers Guide to Statistics
17/18
17
Fractional Factorial Fit
Estimated Effects and Coefficients for acid2 (coded units)
Term Effect Coef StDev Coef T P
Constant 42.2642 1.337 31.62 0.000
aspirin 59.4971 29.7485 1.337 22.26 0.000
ascorbic 28.5346 14.2673 1.337 10.67 0.000
age 14.8242 7.4121 1.337 5.55 0.000
aspirin*ascorbic 14.2761 7.1380 1.337 5.34 0.000
aspirin*age -0.0245 -0.0123 1.337 -0.01 0.993
ascorbic*age 0.5717 0.2859 1.337 0.21 0.832
aspirin*ascorbic*age 1.5459 0.7730 1.337 0.58 0.566
Analysis of Variance for acid2 (coded units)
Source DF Seq SS Adj SS Adj MS F P
Main Effects 3 64034.4 64034.4 21344.8 213.34 0.000
2-Way Interactions 3 2857.9 2857.9 952.6 9.52 0.000
3-Way Interactions 1 33.5 33.5 33.5 0.33 0.566
Residual Error 48 4802.5 4802.5 100.1
Pure Error 48 4802.5 4802.5 100.1
Total 55 71728.2
Unusual Observations for acid2
Obs acid2 Fit StDev Fit Residual St Resid
11 34.105 12.708 3.781 21.398 2.31R
44 103.573 84.959 3.781 18.614 2.01R
55 47.981 26.582 3.781 21.399 2.31R
R denotes an observation with a large standardized residual
0 10 20
AC
BC
ABC
AB
C
B
A
Pareto Chart of the Standardized Effects(response i s acid2 , Alph a = .10)
A: aspirinB: ascorbic
C: a ge
-15 -10 -5 0 5 10 15 200
5
10
Frequency
Histogram of Residuals
0 10 20 30 40 50 60
-40
-30
-20-10
0
10
20
30
40
Observation Number
Re
sidual
I Chart of Res iduals
22
X=0.000
3.0SL=30.76
-3.0SL=-30.76
0 10 20 30 40 50 60 70 80 90 100
-10
0
10
20
Residual
Residual s vs. Fits
-2 -1 0 1 2
-10
0
10
20
Normal Plot of Residuals
Normal Score
R
esidual
acid2
-1
1
-1-1 11
5
15
25
35
45
55
65
75
85
95
ascorbic aci
aspirin
Mean
Interaction Plot (data means) for acid2
-
8/6/2019 Beginers Guide to Statistics
18/18
18
With 7 replicates, we can see more clearly, and all the variables have statistical
significance. We can also see that there is a significant interaction between aspirin andascorbic acid. This is different from the two producing a simple additive result. Thecombination of aspirin and ascorbic acid being in their high state produces a stronger
result than just adding the two results together.
We have also performed an analysis of our residuals. This is something you shouldalways do when you have replicates. We want to see residuals that are normallydistributed, time stable, and that are fairly equally spaced about zero in the residuals vs.
fits graph. These residuals arent perfect, but they are quite adequate.
We have nicely recovered our factors, too, even in the face of a fair amount of deliberatestatistical noise. The recovered factors correspond well with the ones used to generatethe outcomes.
This class of experimental designs has some tremendous benefits:
1. It is vastly more efficient that single factor designs. For the price of a T test,which provides information on a single variable, we get five variables and all their
possible interactions. For six or more variables, it becomes more efficient to useadvanced versions of this design, which are beyond the scope of this paper.
2. Single factor designs will discover interactions only with the greatest of labor andsuccessful intuition. 2K designs discover and characterize them with great ease.
3. 2K designs are more likely to discover effects than simple T tests are. Factors thatcloud a T test with noise can be made into input variables in this type of test, andthat removes their ability to conceal effects.
Promontory Management Group, Inc.
www.pmg.cc801 710 5645
This document provided free of charge, for the useof the person who requested it. It may not bereproduced, distributed, or incorporated in a
course of study without written permission.