beginers guide to statistics

Upload: sreejit-menon

Post on 08-Apr-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/6/2019 Beginers Guide to Statistics

    1/18

    1

    A Beginners Guide to Statistical Testing Denton Bramwell, Nov 15, 1999

    Promontory Management Group Inc., [email protected]

    Statistics was invented as an occupation in order to provide employment for those who do

    not have enough personality to go into accounting. I guess that explains why I took up

    Six Sigma, with its concomitant involvement in statistics.

    In this short paper, Ill explain in the simplest way I can the practical fundamentals of afew key statistical tests. The statistical software that I use is Minitab version 12. [Note:

    Version 13 has since been released.]

    Data Types

    There are four types of data. It is important to know which kind of data you are dealing

    with, since this determines the types of test you can do. The four types are

    Nominal or categorical: Classes things into categories, such as pass/fail, old/new,

    red/blue/green, or died/survived. Ordinal: Ranks things. Team B is stronger than C, which, is in turn stronger than

    A.

    Interval: Things you can add and subtract, like degrees Celsius or Fahrenheit.Interval data can be continuous or discrete. Continuous data can take on anyvalue, such as the actual air pressure inside a tire. Discrete data comes in steps,like money. In the US, the smallest increment of money is one cent, so

    expressions of money normally come in steps of one cent.

    Ratio: Similar to interval data, but zero indicates a total absence of a property.Temperature expressed in Kelvins is ratio data. Like interval data, ratio data maybe continuous or discrete.

    Nominal data yields the least information. Ordinal data is better. Interval or ratio data isbest.

    Suppose you have been told that you will be taking a long trip, to 10 differentdestinations, somewhere in the world. If you are told only that five are in the northern

    hemisphere, and that five are in the southern hemisphere, you have no idea what kind ofclothes to pack. That is because you have been given categorical data, and categorical

    data conveys the least information of any of the data types. On the other hand, if you aregiven the exact coordinates of your 10 destinations, you will know exactly what to pack.Expressing positions on the Earths surface requires ratio data, and ratio data conveys the

    most information. If possible, use interval or ratio data in conducting your investigations.

    Basics of Hypothesis Testing

    The null hypothesis, Ho, is the dull hypothesis: Nothing interesting is going on. For

    example, it makes no difference whether people receive an antibiotic or a placebo.

  • 8/6/2019 Beginers Guide to Statistics

    2/18

  • 8/6/2019 Beginers Guide to Statistics

    3/18

    3

    The Chi Square Test

    This is a test using nominal data. It makes very few assumptions, so it can be used whereother tests are useless. The bad news is that it can require a lot of data, and the results are

    sometimes hard to interpret.

    This test is performed on data organized into rows and columns. For example, we might

    want to test survival of patients who are given one of three types of care in a lifethreatening situation. The data array might look like

    Treatment 1 Treatment 2 Treatment 3

    Survived 21 40 19

    Died 9 2 11

    The null hypothesis of the Chi Square test is that the rows and columns are statistically

    independent. That is, if I know which column a case is in, I cannot make much betterthan a random guess as to which row it is from, or vice versa.

    Lets now go to Minitab and perform the test. Enter the data table, and choose

    STATS|TABLES|CHI SQUARE. You will obtain this report:

    1 21 40 19 80

    23.53 32.94 23.53

    2 9 2 11 22

    6.47 9.06 6.47

    Total 30 42 30 102

    Chi-Sq = 0.272 + 1.513 + 0.872 +

    0.989 + 5.500 + 3.171 = 12.316

    DF = 2, P-Value = 0.002

    You can see that our original input is printed, with row and column sums. Beneath eachof the original entries is an expected value for each cell. These are the numbers 23.53,

    32.94, and so on. If any of these values is below 5, it means that you have not collectedenough data to run this test, and cannot depend on the result. In this case, our lowest cell

    has an expected value of 6.47, so we are good to go.

    The interpretation of this test lies in the P value, which is our risk, or probability ofbeing wrong if we assert there is a difference. It says that we would run a .2% chance ofbeing wrong if we asserted that the treatments produced different survival rates. Thats a

    strong outcome. The weakness of the test is that all you really statistically know is thatthey are not the same. You dont officially know which treatment is better or worse.

  • 8/6/2019 Beginers Guide to Statistics

    4/18

    4

    The Tukey Tail Count

    The B vs. C, or Tukey Tail Test is extremely easy to use, and reasonably powerful. It isnonparametric. That means that it does not depend on the data being normally

    distributed, so you can use this test when you have really ugly data, and still get

    beautiful results. Another good feature of the B vs. C test is that strong statisticalindications can be found with very small amounts of data. This test uses ordinal data.

    C represents one process that we want to evaluate, usually the current process that we

    want to replace. B represents the other process, usually the better process that we wantto put in place. John Tukey is the statistician that developed and popularized the test thatB vs. C is based on. The data analysis is based on counting Bs and Cs on the ends, or

    tails of a distribution. Thats where the name Tukey Tail Test comes from.

    This test absolutely requires randomization and/or blocking. It is tempting to just run afew Cs, then switch over and run a few Bs, and make a decision. This is an invitation to

    error. You must either block or randomize such variables as age, gender, or other factorsthat might influence the outcome.

    Suppose that you randomly select just 3 items from your C process, and 3 items fromyour B process, and that all 3 of your Bs were better than your Cs. Intuitively, youdprobably feel that you were on to something, and that your B process was indeed better.

    Statistics support your intuition.

    There are just 20 different orders you can put 3 Bs and 3 Cs into (Try it!). Only one ofthose is 3 Bs above 3 Cs. You have just one chance in 20, or .05 probability of arrivingat such an arrangement by chance. So you if all 3 Bs are better than all 3 Cs, you can

    be 95% confident that B is indeed better than C, and you have arrived at this with onlysix samples.

    One other nicety is that this works for cosmetic issues. You dont really have to be ablemeasure goodness. You just have to be able to rank your Bs and Cs from highest to

    lowest. So if you want to test the attractiveness of men/women in the fashion modelingbusiness vs. the attractiveness of men/women in med school, this is your test, assuming

    you can put all of them in rank order of attractiveness.

    The following table shows the number of Bs and Cs you need, for a given level of

    risk. Note that for each level of, there are several acceptable combinations of B and C

    sample sizes. Remember, you must randomize your selections. Also remember, all yourBs must outrank all your Cs. Note that the table includes our 3 B vs. 3 C example.

  • 8/6/2019 Beginers Guide to Statistics

    5/18

    5

    Risk Number of Bsamples

    Number of C

    samples

    2 43

    3 16

    4 105 8

    .001

    6 6

    2 13

    3 7

    4 5

    .01

    5 4

    1 19

    2 5

    3 3

    .05

    4 3

    1 92 3.1

    3 2

    There is another variation on this test that is useful, and in many cases preferred to thesystem already shown. In this system, all the Bs do not need to outrank all the Cs. It

    does require that the number of B samples and the number of C samples must beapproximately equal. If the ratio between the size of the B sample and the C sample falls

    in the range of 3:4 to 4:3, the sample sizes are near enough to being equal. This test alsoabsolutely requires randomization.

    Suppose that you draw a random sample of 6 Bs and 6 Cs, and that you rank them inorder of goodness. Your distribution might look like this:

    BB

    BB

    CB

    C

    CB

    CC

    C

    Best

    Worst

    Pure Bs. This is yourB end count.

    Mixed Bs and Cs.

    Disregard.

    Pure Cs. This is yourC end count.

  • 8/6/2019 Beginers Guide to Statistics

    6/18

    6

    Now just add your B end count to your C end count, and refer to this chart for your levelof significance.

    Risk B+C End CountAt Least

    .1 6.05 7

    .01 10

    .001 13

    Since our total end count is 7, we can be 95% confident that B is better than C.

    The Two Sample T Test

    This is a powerful test, using interval or ratio data. Its null hypothesis is that the meansof two normally distributed populations are the same. The alternate hypothesis can be

    that they are not equal, or that the mean of A is greater/less than the mean of B.

    This test requires that fundamental assumptions are met.

    The data are normally distributed. A normal distribution is also called a Gaussiandistribution. Actually the test is pretty robust, and will still give good results withpretty non-normal data.

    The data are stable during the sampling period. This assumption is veryimportant.

    You must demonstrate that you meet these assumptions if you want to cruise down the

    Gaussian superhighway. Fortunately, this is easy.

    0 5 10 15 20 25

    80

    90

    100

    110

    120

    Observation Number

    IndividualValue

    I Chart for DATA1

    X=100.1

    3.0SL=117.6

    -3.0SL=82.61

    First, lets check stability for DATA1. There are several ways to do this, but my favoriteis the Individuals Control Chart. This is run by clicking STATS|CONTROL

  • 8/6/2019 Beginers Guide to Statistics

    7/18

    7

    CHARTS|INDIVIDUALS. The basic thing were doing is verifying that there are notrends or sudden, major shifts or trends in the data (stability). This chart spreads the data

    out in time, and allows us to inspect it. DATA1 is just fine.

    DATA2 is not so fine. I see an upward trend in the first dozen data points, followed by a

    sudden downward shift, then a sudden upward shift at about data point 19. This datadoes not meet our assumption of stability.

    2520151050

    130

    120

    110

    100

    90

    80

    Observation Number

    IndividualValue

    I Chart for DATA2

    1

    X=106.6

    3.0SL=129.0

    -3.0SL=84.29

    151050

    30

    20

    10

    0

    -10

    -20

    -30

    Observation Number

    IndividualValue

    I Chart for DATA3

    X=0.6667

    3.0SL=25.36

    -3.0SL=-24.03

    DATA3 has a discrimination problem. The only values that the data can take on are 10,-5, 0, 5, and 10. Since the data can only exist in one of five states, it will not pass a

    normality test. Ive researched the matter, and have not been able to get an accuratestatement of how much discrimination is enough. My opinion is that you should see atleast 10 different values in the data, and there is some reasoning behind this opinion,

    which we dont have space to discuss here.

    We have now concluded that DATA2 and DATA3 are not prime candidates for the

    Gaussian./normal model. This does not mean that we absolutely cannot apply the model,but it does mean that if we do apply it, we use the results with appropriate caution.

  • 8/6/2019 Beginers Guide to Statistics

    8/18

    8

    DATA1 has met both of our tests so far, and requires only one more test. That is for

    normality. This test is done by clicking STAT|BASIC STATS|NORMALITY TEST.DATA1 produces this result:

    P-Value: 0.953

    A-Squared: 0.152Anderson-Darling Normality Test

    N: 24

    StDev: 5.26480Average: 100.082

    11010090

    .999

    .99

    .95

    .80

    .50

    .20

    .05

    .01

    .001

    Probability

    DATA1

    Normal Probability Plot

    The first test is to simply look at the data points and see if they fall close to the red line.

    How close is close enough? We generally use the rule that if it can be covered by a fatpencil, it is good enough. In fact, if you have fewer than 15-20 data points, this is the

    only test you need apply. P values are not too revealing for small samples. However, ifyou have many points, pay close attention to the P value. If it is .05 or more, you have noreason to believe the data is non-normal.

    Another way to look for normality is to do STAT|BASIC STATS|DISPLAY

    DESCRIPTIVE STATISTICS, and under GRAPHS, check GRAPHICAL SUMMARY.That will produce this result:

    1101051009590

    95% Confidence Interval for Mu

    102.5101.5100.599.598.597.5

    95% Confidence Interval for Median

    Variable: DATA1

    97.551

    4.092

    97.859

    Maximum3rd QuartileMedian

    1st QuartileMinimum

    N

    KurtosisSkewnessVarianceStDevMean

    P-Value:

    A-Squared:

    101.980

    7.385

    102.306

    112.583103.928100.344

    96.66690.854

    24

    9.98E-040.34134427.7181

    5.265100.082

    0.953

    0.152

    95% Confidence Interval for Median

    95% Confidence Interval for Sigma

    95% Confidenc e Interval for Mu

    Anderson-Darling Normality Test

    Descriptive Statistics

    Note that we get our same normality P value, .953, and that the computer draws its bestestimate of a normal curve that fits the data histogram. Dont be shocked if your data

  • 8/6/2019 Beginers Guide to Statistics

    9/18

    9

    looks a lot more ragged than this, but still tests normal. Small samples can look prettyRaggedy Andy, and still truly be normal.

    Using the normal distribution superhighway has its distinct advantages. The toll is that

    you should give at least a little attention to checking the assumptions. In practice, you

    need not be overly concerned about normality if your samples are reasonably large. It isnot necessary for the distribution of data to be normal. It is only necessary that the

    distribution of differences be normal, and, with decent size samples, that will be thecase.

    Let us assume that we have been collecting data on a weight loss product. Two randomlyselected groups of volunteers are to be tested. Group A receives a placebo, and group B

    receives Factor L. Both groups are weighed at the beginning of the study, and 90 dayslater, and the change in weight is calculated. For each volunteer we have a number that

    represents weight gain. A gain of 8 would be a loss of 8 pounds. We have tested thedata for normality, and stability, and it is satisfactory. We select as our null hypothesis:

    Weight gain does not depend on whether the volunteer received the placebo or Factor L.Our alternate hypothesis is that the mean gain for the control group is greater than themean gain for the test group.

    The test is performed by invoking STATS|BASIC STATS|2-SAMPLE T.

    At this point, we must confess that we have slipped a new idea into the mix. We havedeveloped a lot of our ideas based on the Gaussian distribution. It turns out there is

    another distribution that is practically identical to the Gaussian distribution for sampleslarger than 30. However, this other distribution is more accurate than the Gaussian forsmaller sample sizes. This is the Students T distribution. We almost always use it, since

    it applies everywhere the Gaussian distribution does, and some places that it does not.

    Assume that my control group is in the column CONTROL, and that the test group is inthe column FACTORL. You would then choose STAT|BASIC STATS|2-SAMPLE T.From the dialog window choose SAMPLES IN DIFFERENT COLUMNS, and indicate

    CONTROL as the first column and FACTORL as the second column. As yourALTERNATIVE, indicate GREATER THAN. This may seem a little backwards, but we

    are hoping the weight gains in the first column are greater than the weight gains in thesecond column. Minitab is not specific about telling you how to do this, so you mustknow that it assumes the form COLUMN 1 [>,

  • 8/6/2019 Beginers Guide to Statistics

    10/18

    10

    FACTORL 24 -2.61 4.36 0.89

    95% CI for mu CONTROL - mu FACTORL: ( 1.40, 5.85)

    T-Test mu CONTROL = mu FACTORL (vs >): T = 3.29 P = 0.0010 DF = 42

    FACTORLCONTROL

    10

    5

    0

    -5

    -10

    -15

    Boxplots of CONTROL and FACTORL

    (means are indicated by solid circles)

    The interpretation of this is that our control group of 24 volunteers gained an average of

    1.01 pounds, and that the standard deviation of the groups gain was 3.19 pounds. Ourtest volunteers gained an average of 2.61 pounds (2.61 pound loss), and the standard

    deviation of the groups gain was 4.36 pounds. If we assert that the gain of the controlgroup is greater than the gain of the test group, there is a .0010 chance that we will bewrong. We therefore reject the null hypothesis, and accept the alternative hypothesis.

    The box plot gives a nice visual to demonstrate what the statistics tell us.

    It is by no means necessary to have equal sample sizes.

    We did not check the ASSUME EQUAL VARIANCES box, because we have not taken

    the step of demonstrating that the standard deviations (and hence the variances) of thetwo groups are approximately equal. If you want to run TEST OF EQUAL

    VARIANCES on your data, you can earn the right to check this box. However, doingthis essentially relieves the computer of some of its calculation duties, and some expenseof your time, with negligible difference in results. If you are feeling compassionate

    toward your overworked computer, you may want to take this route. I suggest justleaving the box unchecked.

    The default alternative hypothesis is NOT EQUAL. We chose to do a one sided test,GREATER THAN, because it always gives the test more power. If we had chosen NOT

    EQUAL, our P value would have been only .0020. In this case, we would still makethe same decision, but, still, we have doubled our P value. Frequently, this will be of

    great importance. If the test results are interesting only if the results come out in oneparticular direction, then use the one sided test. That is most of the time.

  • 8/6/2019 Beginers Guide to Statistics

    11/18

    11

    ANOVA

    ANOVA stands for Analysis of Variance. It is very much like the 2 Sample T Test, but

    instead of having just two groups, you can have multiple groups. This is extremely

    handy. The null hypothesis of ANOVA is that all the groups have the same mean. Also,it assumes that all the groups have roughly the same standard deviation. Before running

    ANOVA, you should do Test of Equal Variances to ensure this. If LeVenes test comesout with a P of .05 or more, youre generally good to go.

    Giving ANOVA such a short discussion shouldnt be interpreted as indicating that it isless useful. It is an extremely useful tool, but beyond the scope of a short paper. If you

    understand the T Test, you should be able to move into simple ANOVA without muchdifficulty.

    Power and Sample Size

    DO NOT BEGIN AN EXPERIMENT THAT YOU ARE GOING TO ANALYZE WITHA T TEST OR A 2K FACTORIAL WITHOUT UNDERSTANDING THIS SECTION!

    Risk, sample size, and the size of change you are trying to detect are three factors that areeternally at odds. It is rather like the old engineering maxim, Good, fast, cheap. Pick

    any two. You can run a test for a small change, using a small sample size, but your riskof error will be high. You can run a test with low risk, trying to detect a small change,

    but your sample size will be large.

    A powerful test is one that is not highly influenced by random factors. The mathematical

    definition is 1- . It is the probability that you will detect a difference, if, in fact, it exists.

    If you properly use this function of Minitab, you will be able to predict the approximatechances of success of your experiment. You would be amazed how many tests are run

    that have too few samples to have any reasonable chance of success. Conversely, youwould also be amazed at how many tests are much more expensive than necessary,because they have far too many samples.

    If someone wants you to run a test that you can demonstrate has a power of .5 or less, I

    suggest that you give them a quarter, and tell them to make their decision based onwhether it comes up heads or tails. If your test is no better than a quarter, dont botherrunning it. Get on to something that does have a good chance of success.

    Lets suppose that you know that normal for Factor K in the general population

    includes values from 90 to 140. Ben assures me that the medical standard for normal is95% of the population. Thats convenient, because 95% of the population occurs from 2standard deviations below the mean to 2 standard deviations above the mean. We know,

    then, that 140 to 90 is four standard deviations, and can quickly deduce that one standarddeviation is about 12.5.

  • 8/6/2019 Beginers Guide to Statistics

    12/18

    12

    Now suppose that we have a new treatment that we think will beneficially increase Factor

    K by 15 points. Go to STAT|POWER AND SAMPLE SIZE. Choose 2-SAMPLE T,then CALCULATE SAMPLE SIZE FOR EACH POWER VALUE. In POWER

    VALUES, enter .95 .9 .87. In DIFFERENCE enter 15. In SIGMA, enter 12.5. Click

    OPTIONS, and choose GREATER THAN as your alternative hypothesis. Minitab willreturn this:

    2-Sample t Test

    Testing mean 1 = mean 2 (versus >)

    Calculating power for mean 1 = mean 2 + 15

    Alpha = 0.05 Sigma = 12.5

    Sample Target Actual

    Size Power Power

    16 0.9500 0.9527

    13 0.9000 0.9077

    12 0.8700 0.8854

    In this case, if we want a power of at least .95, we need two samples of 16 each. For apower of .9, we need only 13 in each sample, and if we are willing to settle for only .87,

    we need 12 in each sample.

    Similarly, we can calculate our chances of success if sample size is dictated by budget orthe ever innumerate pointy haired boss.

    2K

    Factorial Experiments

    These are wonderful.

    When you to a T Test, or a single-factor ANOVA, you are changing one variable, and

    studying the effect on your test subjects. This is a one factor at a time or OFATexperiment.

    OFAT experiments are fine, but you have to realize that any factor that you dont accountfor shows up as noise, which makes your distributions fatter, and makes your effect

    harder to see. That is, cholesterol levels depend a number of factors, such as diet, level ofexercise, and heredity. In an OFAT, you randomize to make sure that these variables do

    indeed show up as noise. Would it not be better to make them part of the experiment?Then they can be made into signal instead of noise.

    A 2K experiment allows you to do this. It also allows you to detect interactions betweenthe variables. The lovely thing is that you can do this for four or five variables, and still

    not have to take many more observations.

  • 8/6/2019 Beginers Guide to Statistics

    13/18

    13

    For example, you can test two treatments simultaneously, and account for any differencesdue to gender, and you can test for all the interactions, all for the price of a T Test.

    To set up your test, choose two to five variables you want to test. More than this can get

    a little messy. Choose a high and a low value for each variable. It is better if these

    are interval values, but they can be categorical/nominal. For example, if I want to test theeffect of ascorbic acid and aspirin on acidity, I might choose zero as my low dose for

    both, and 500 mg as my high dose for both. I might want to test subject age as my othervariable. I could pick 20-25 as the low value for my age variable and 55-60 as the high

    value.

    My output will be some very scientifically arrived at acidity scale, which extends from 0

    to 100.

    For a simple demonstration, lets do this with a single replicate, and no center points.Well add these embellishments in class.

    In Minitab, go to STAT|DOE|CREATE FACTORIAL DESIGN. Choose 2 LEVELFACTORIAL, with 3 as your number of variables. Under DESIGNS, choose FULL

    FACTORIAL, with 0 center points, 1 replicate, and 1 block. Click OK, then go toFACTORS, and in the NAME column, put ascorbic acid next to A, aspirin next toB and age next to C. Your low value will automatically be coded as a 1, and your high

    value will automatically be coded as a 1. Yes, you can change this, but wait until youunderstand a little better.

    Click OK, and OK again, and Minitab will design your experiment for you. It willautomatically randomize your run order for you. (Actually, with a single replicate,

    randomization buys you nothing, so we would have been justified in not randomizing thistime. However, we will shortly drop a variable, effectively creating a second replicate,

    and, at that point, randomization starts to have some use.) Your design will look differentfrom mine, because your random number generator spat out a different sequence thanmine did---well, we hope so, or they are not so random. Anyway, mine looks like this:

    StdOrder RunOrde

    r

    CenterPt Blocks ascorbic

    acid

    aspirin age acidity

    2 1 1 1 1 -1 -1

    3 2 1 1 -1 1 -1

    8 3 1 1 1 1 1

    1 4 1 1 -1 -1 -14 5 1 1 1 1 -1

    6 6 1 1 1 -1 1

    7 7 1 1 -1 1 1

    5 8 1 1 -1 -1 1

    This instructs me to find a young subject, give him a high dose of ascorbic acid, a low

    dose of aspirin, and note the resulting acidity. I then continue on down the list. One

  • 8/6/2019 Beginers Guide to Statistics

    14/18

    14

    missing data point is not necessarily the kiss of death when you have multiple replicates(sets of data), but you do want to try very hard to get complete sets.

    In my table, I created a made-up set of results, and Minitab obligingly spits out the

    following interesting output:

    Fractional Factorial Fit

    Estimated Effects and Coefficients for acidity (coded units)

    Term Effect Coef

    Constant 43.7500

    aspirin 57.5000 28.7500

    ascorbic 27.5000 13.7500

    age 10.0000 5.0000

    aspirin*ascorbic 7.5000 3.7500

    aspirin*age 0.0000 0.0000

    ascorbic*age -0.0000 -0.0000

    aspirin*ascorbic*age 0.0000 0.0000Analysis of Variance for acidity (coded units)

    Source DF Seq SS Adj SS Adj MS F P

    Main Effects 3 8325.00 8325.00 2775.00 * *2-Way Interactions 3 112.50 112.50 37.50 * *

    3-Way Interactions 1 0.00 0.00 0.00 * *

    Residual Error 0 0.00 0.00 0.00

    Total 7 8437.50

    The interpretation of this output is that going from the low state of aspirin to the high

    state accounts for 57.5 units of the observed change in acidity. Similarly, ascorbic acid is

    0 10 20 30 40 50

    ABC

    AC

    BC

    AB

    C

    B

    A

    Pareto Chart of the Effects

    (response i s acid ity, Alpha = .10)

    A: aspi rinB: ascorbicC: age

  • 8/6/2019 Beginers Guide to Statistics

    15/18

    15

    responsible for 27.5 units of change. Age, and an interaction between aspirin andascorbic acid account for lesser amounts of change, and the model predicts that these last

    two are not statistically significant. We shall see more about this as we add replicates.

    Since age is not indicated as a significant variable, we can gain more information about

    the important variables by eliminating this unimportant one.

    By going to STAT|DOE|ANALYZE FACTORIAL DESIGN|TERMS, we can eliminateall terms except aspirin and ascorbic acid. Since we have only 2 main terms, only 4 runs

    are required per replicate. But note that we still have 8 observations. The computer willrecognize this, and turn our 4 extra observations into a replicate, giving us two replicates.The output now becomes quite a bit stronger.

    Minitab does do a quick shuffle on us, without telling us---with a single replicate, it

    reports the magnitude of the effect. When we get two replicates, it reports a T value.The output looks like this:

    Estimated Effects and Coefficients for acidity (coded units)

    Term Effect Coef StDev Coef T P

    Constant 43.75 2.795 15.65 0.000aspirin 57.50 28.75 2.795 10.29 0.000

    ascorbic 27.50 13.75 2.795 4.92 0.004

    Analysis of Variance for acidity (coded units)

    Source DF Seq SS Adj SS Adj MS F P

    Main Effects 2 8125.0 8125.0 4062.50 65.00 0.000

    Residual Error 5 312.5 312.5 62.50

    Lack of Fit 1 112.5 112.5 112.50 2.25 0.208

    Pure Error 4 200.0 200.0 50.00

    0 5 10

    ascorbic

    aspirin

    Pareto Chart of the Standardized Effects

    (response is acidi ty, Alpha = .10)

  • 8/6/2019 Beginers Guide to Statistics

    16/18

    16

    Total 7 8437.5

    We now get F and P values, and note that we still have about the same estimates for the

    constant (average acidity with no treatment), aspirin, and ascorbic acid. We also see thatour model accounts for 8125/8437.5 of the variation, with 312.5/8437.5 of the variation

    being attributed to noise (variables we did not account for).

    We have less than one chance per thousand (.001) of being wrong if we assert that aspirin

    and ascorbic acid increase acidity. We also know the magnitude of change that eachvariable causes. Thats a pretty substantial model, for only 8 observations.

    Even if we cannot reduce the model by dropping one variable, a simple 8 observationexperiment for three variables can be very revealing, especially if we are in the screening

    stages.

    The output of the test assumes a linear model. If we want to test the validity of thisassumption, we can add center points. Three per replicate is pretty normal.

    Using Power and Sample Size, we find that if we want to be .95 sure of catching a 15point shift in a process that normally runs with a standard deviation of 15, and run no

    more than a .05 alpha risk, we need 7 replicates, or 56 total observations. For 56observations, we get full evaluation of three variables and all their interactions. A T Testto detect a single variable on this level requires 54 observations. Two more observations

    give us a lot more information. A four variable experiment with these objectives requiresonly 64 observations. Actually, the factorial experiment is much more likely to succeed

    than the T Test, besides being much more informative.

    Why should this be so? Suppose we are testing blood cholesterol levels, and our

    variables are Medicine X, gender, and exercise level. If we test Medicine X vs. placeboon the general population, we will have relatively broad distributions, since we randomly

    select males, females, couch potatoes and athletes to ensure that we represent the wholepopulation. In a 2K, our distributions will have relatively smaller standard deviations,since we will be considering more restricted groups, made up of males who exercise,

    male couch potatoes, female athletes, and female couch potatoes. These are more selectgroups. We have taken some of the variation out, and made it into variables which we

    test. Skinnier distributions make it easier to detect smaller changes, so our chances ofsuccess are better than they would be with a T Test.

    To demonstrate the effect of replicates, the same data was copied into an experiment with

    the same variables, and 7 replicates instead of 1. Random numbers with a mean of 0 anda standard deviation of 10 were added to the data, to simulate the effects of unaccountedfor variables and measurement system error. Our results are:

  • 8/6/2019 Beginers Guide to Statistics

    17/18

    17

    Fractional Factorial Fit

    Estimated Effects and Coefficients for acid2 (coded units)

    Term Effect Coef StDev Coef T P

    Constant 42.2642 1.337 31.62 0.000

    aspirin 59.4971 29.7485 1.337 22.26 0.000

    ascorbic 28.5346 14.2673 1.337 10.67 0.000

    age 14.8242 7.4121 1.337 5.55 0.000

    aspirin*ascorbic 14.2761 7.1380 1.337 5.34 0.000

    aspirin*age -0.0245 -0.0123 1.337 -0.01 0.993

    ascorbic*age 0.5717 0.2859 1.337 0.21 0.832

    aspirin*ascorbic*age 1.5459 0.7730 1.337 0.58 0.566

    Analysis of Variance for acid2 (coded units)

    Source DF Seq SS Adj SS Adj MS F P

    Main Effects 3 64034.4 64034.4 21344.8 213.34 0.000

    2-Way Interactions 3 2857.9 2857.9 952.6 9.52 0.000

    3-Way Interactions 1 33.5 33.5 33.5 0.33 0.566

    Residual Error 48 4802.5 4802.5 100.1

    Pure Error 48 4802.5 4802.5 100.1

    Total 55 71728.2

    Unusual Observations for acid2

    Obs acid2 Fit StDev Fit Residual St Resid

    11 34.105 12.708 3.781 21.398 2.31R

    44 103.573 84.959 3.781 18.614 2.01R

    55 47.981 26.582 3.781 21.399 2.31R

    R denotes an observation with a large standardized residual

    0 10 20

    AC

    BC

    ABC

    AB

    C

    B

    A

    Pareto Chart of the Standardized Effects(response i s acid2 , Alph a = .10)

    A: aspirinB: ascorbic

    C: a ge

    -15 -10 -5 0 5 10 15 200

    5

    10

    Frequency

    Histogram of Residuals

    0 10 20 30 40 50 60

    -40

    -30

    -20-10

    0

    10

    20

    30

    40

    Observation Number

    Re

    sidual

    I Chart of Res iduals

    22

    X=0.000

    3.0SL=30.76

    -3.0SL=-30.76

    0 10 20 30 40 50 60 70 80 90 100

    -10

    0

    10

    20

    Residual

    Residual s vs. Fits

    -2 -1 0 1 2

    -10

    0

    10

    20

    Normal Plot of Residuals

    Normal Score

    R

    esidual

    acid2

    -1

    1

    -1-1 11

    5

    15

    25

    35

    45

    55

    65

    75

    85

    95

    ascorbic aci

    aspirin

    Mean

    Interaction Plot (data means) for acid2

  • 8/6/2019 Beginers Guide to Statistics

    18/18

    18

    With 7 replicates, we can see more clearly, and all the variables have statistical

    significance. We can also see that there is a significant interaction between aspirin andascorbic acid. This is different from the two producing a simple additive result. Thecombination of aspirin and ascorbic acid being in their high state produces a stronger

    result than just adding the two results together.

    We have also performed an analysis of our residuals. This is something you shouldalways do when you have replicates. We want to see residuals that are normallydistributed, time stable, and that are fairly equally spaced about zero in the residuals vs.

    fits graph. These residuals arent perfect, but they are quite adequate.

    We have nicely recovered our factors, too, even in the face of a fair amount of deliberatestatistical noise. The recovered factors correspond well with the ones used to generatethe outcomes.

    This class of experimental designs has some tremendous benefits:

    1. It is vastly more efficient that single factor designs. For the price of a T test,which provides information on a single variable, we get five variables and all their

    possible interactions. For six or more variables, it becomes more efficient to useadvanced versions of this design, which are beyond the scope of this paper.

    2. Single factor designs will discover interactions only with the greatest of labor andsuccessful intuition. 2K designs discover and characterize them with great ease.

    3. 2K designs are more likely to discover effects than simple T tests are. Factors thatcloud a T test with noise can be made into input variables in this type of test, andthat removes their ability to conceal effects.

    Promontory Management Group, Inc.

    www.pmg.cc801 710 5645

    This document provided free of charge, for the useof the person who requested it. It may not bereproduced, distributed, or incorporated in a

    course of study without written permission.